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Preface 


System Modeling: Why and Why Now? 


Vipul Periwal, Zoltan Szallasi and Jörg Stelling 





Introduction 


Biology is the study of self-replicating chemical processes. Biology is the study of 
systems accurately transmitting a genetic blueprint. Biology is the study of complex 
adaptive reproducing systems. 

What is systems biology if all definitions of biology implicitly or explicitly refer to 
the study of a whole object, whether it is a virus, a cell, a bacterium, a protozoan 
or a metazoan? We treat systems biology as the quantitative study of biological 
systems, aided (or hindered) by technological advances that both permit molecular 
observations on far more inclusive scales than possible even 15 years ago, and permit 
computational analysis of such observations. Thus, for the purposes of this book, 
systems biology is the promise of biology on a larger and quantitatively rigorous 
scale, a marriage of molecular biology and physiology. Concretely, this defines the 
focus of the book: data-centric quantitative modeling of biological processes and 
systems. 

Biology is an experimentally driven science simply because evolutionary processes 
are not understood well enough to allow theoretical advances to rest on terra firma. 
Systems biology is experimentally driven, computationally driven, and knowledge 
driven. It is experimentally driven because the complexity of biological systems is 
difficult to penetrate without large-scale coverage of the molecular underpinnings; 
it is computationally driven because the data obtained from experimental investi- 
gations of complex systems need extensive quantitative analysis to be informative; 
and it is knowledge driven because it is not computationally feasible to analyze the 
data without incorporating all that is already known about the biology in question. 
Furthermore, the use of data, computation and knowledge must be concurrent. 
Available knowledge guides experiment design, novel knowledge is generated by the 
computational analysis of new data in light of available knowledge, and the cycle 
repeats. 

The difference between knowledge and data is central to understanding the 
underpinnings of systems biology. The sequencing of whole genomes is a good 
example. Any given genome is data. Without extensive analysis, it is just as 
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uninformative about biological processes as a photograph of the night sky. First 
steps in transforming a genome into knowledge include identifying genes, identifying 
transcription factor binding sites, finding the transcription factor complexes that 
control the expression of the genes, and finding the chromatin structure in the 
cell being studied, to determine which genes are accessible for transcription. While 
this is wildly optimistic in terms of the knowledge that can be extracted from the 
genome data, it is still nowhere close to the level of understanding required to make 
predictions about the response of an organism to a specific stimulus. A reductionist 
approach to biology is bootless because complex adaptive systems are inherently 
nonlinear, so their behavior is well summarized by the statement: the whole is more 
than the sum of the components. 





Handicapping the Bout 


From a quantitative perspective, there are striking features of biological dynamics 
that make analysis challenging: 

1. Large range of spatial scales 

2. Large range of temporal scales 


3. A lack of separation between responses to external stimuli versus internal pro- 
grams 


. Multiple functionalities of constituents 
. Multiple levels of signal processing 
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6. Incomplete evolutionary record 

7. Wide range of sensitivities to perturbations 
8 


. Genotypic variation 


None of these challenges is an absolute barrier to progress. Nevertheless, these 
challenges must be addressed to make real progress. 
From an experimental perspective, the challenges of biology are better under- 
stood: 
Coverage in terms of components and interactions 
Reproducibility 
Spatial resolution 
Temporal resolution 
Cross-validation 


Combinatorial perturbations 
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Accuracy 
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From a knowledge perspective, there are four central problems: 


1. Find an appropriate level of abstraction for a given analytic problem. 


2. Find a common basis to relate knowledge gained using different experimental 
techniques on the same system. 


3. Find a common basis to relate knowledge gained from the same experiment on 
different model systems. 


4. Incorporate knowledge incrementally as new data is analyzed. 


Taking all these difficulties together, it is not surprising that researchers tradi- 
tionally have considered the study of biological systems rather resistant to quan- 
titative approaches. It is, therefore, worth pointing out to skeptics that in some 
cases thorough quantitative analysis has produced insights into or explanations of 
biological phenomena that would have been impossible without the application of 
advanced mathematical tools. Various chapters in this book will discuss a great 
variety of, often counterintuitive, examples. For instance, the advantages of a more 
extensive mathematical analysis over simpler approaches are emphasized in chap- 
ter 8 (pp. 170-173). When circadian oscillators are analyzed by formal logic, the 
traditional analytical tool of molecular biology, or by macroscopic descriptors such 
as differential equations, the experimentally observed behavior cannot be recon- 
structed from the molecular machinery. Stochastic analysis, however, demonstrates 
how, by random fluctuations, the system escapes the macroscopic point-attractor 
and thus oscillatory behavior is maintained. Examples such as this will probably 
contribute to the long-awaited common ground for discussions between biologists 
and quantitative scientists. The mutual suspicion on both sides, which has been 
difficult to overcome by intellectual curiosity alone, will probably be eliminated by 
the mutual need for each other’s expertise. 





“My Complications Had Complications” 


The goal of systems biology is a predictive understanding of the whole. If the 
whole is more than the sum of its parts, it follows that acquiring a catalog of 
all the parts is not necessarily the first order of business. In a caricature, there 
are two avenues of attack possible: either one focuses on subsystems governing a 
specific function in arbitrary conditions and gains a predictive understanding of the 
system, one subsystem at a time, or one focuses on the system in a restricted set of 
conditions and gains an understanding by gradually increasing the set of conditions 
and, as required, the level of detail in the model of the system. The analogy is with 
molecular biology in the former approach and with physiology in the latter. 

The modeling associated with each approach is distinct. In the molecular biology 
type approach, the aim is to go beyond traditional pathway-centric points of 
view and deal with the challenges of feedback loops formed either directly or 
indirectly due to interactions with other pathways. In the physiology type approach, 


xii Preface 


interactions between the components in the model are added as needed to maintain 
contact with the experimental data. The components in this approach are not 
necessarily directly related to biochemical species. Eventually, these bottom-up 
and top-down approaches should meet. However, each has its own strengths and 
weaknesses and they complement each other. 





Why Read This? 


The importance of feedback loops and crosstalk in almost all facets of biological 
systems has been apparent for several decades. The cell cycle control circuitry or 
the developmental programs in bilaterians are prime examples of this. The ability 
of cancerous cells to evade targeted therapies results largely from biological systems 
having evolved in ways that place a premium on robustness and adaptability. 
Such properties, as yet only nebulously defined, are not localizable to a small 
set of interactions. They reside in the network as a whole, as has been clearly 
demonstrated in predictions on metabolic networks. 

Modeling biological systems faces the challenge of appropriate abstractions— 
levels on which to focus, and details to be left out. For instance, molecular biology 
abounds with mechanistic analogies, but on a more detailed level often the un- 
derlying interactions are driven by chemistry. This makes modeling subtle since 
statistical biases are often the driving force in what superficially appears to be a 
mechanical process, for example, chemotaxis. At what level does such detail be- 
come relevant, and at what level can one ignore it? This is not a priori obvious, 
and one needs rigorous approaches to model parsimony to answer such questions. 
Indeed, the answer to the model selection question depends to a great extent on 
the predictions required. This is an important point in all biological modeling: The 
model, its purpose, and the experimental data are intimately related. A model that 
predicts hepatic glucose uptake precisely but insulin levels with greater uncertainty 
is not a useful selection if the only measurement available is insulin levels. 

There are two main approaches to computational analysis of biological data. 
The causal approach makes concrete deterministic or stochastic models (differ- 
ential equations, stochastic differential equations, Boolean networks, et cetera) of 
biological processes. The probabilistic view is associated with probabilistic infer- 
ence approaches, using pattern recognition or learning algorithms (such as neural 
networks and graphical models) for analysis of data from large-scale experimen- 
tal methods. These two approaches rest on a large part of applied mathematics 
(including numerical integration, optimization, interpolation, and control theory) 
and computer science (search theory, coding theory, and database design). This 
breadth necessitates collaborations between people with diverse backgrounds, but 
an inadequate understanding of the limitations and applicability of techniques and 
concepts from different fields hinders such collaborations. The background infor- 
mation required makes biological modeling a difficult task, but the real challenge 
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remains that of making computational models effective and efficient representations 
of biological systems. 





What’s Included and What’s Not 


This book starts with generalities and progresses towards practicalities. Thus, the 
first section is conceptual, with attempts to define the role of modeling in biology, 
as well as attempts to cut through the miasma that surrounds the use of the terms 
robust, complex, adaptive, and module in the systems biology literature. As will be 
evident, these are important notions that need much further work to crystallize to 
the point where they can be assigned the honorific concept. Nevertheless, these terms 
may ultimately be quantitatively used as concrete guiding principles in modeling. 

The next section provides introductions to general approaches to making models 
of biology: qualitative models, constraint-based models, dynamical systems based 
on differential equations, and stochastic models, as well as models with spatial struc- 
ture. The other side of the modeling coin, probabilistic inference aimed at inference 
from large-scale data sets, is also introduced. The section proceeds from relatively 
simple towards mathematically more demanding approaches. Although each chap- 
ter tries to convey its central messages in an intuitive as well as in a mathematically 
rigorous way, readers arriving from biology will have to realize that each method 
has a certain minimum difficulty level associated with it. While ordinary differential 
equation—based or qualitative models can be quite readily introduced in an intuitive 
manner, stochastic or spatial modeling cannot be described in simple terms and re- 
quire an appropriate level of background in quantitative sciences. Key applications 
of the various modeling approaches are also widely covered. Taken together, this 
section will provide the reader with an overall impression of the relationship between 
the potential utility of quantitative approaches and their associated analytical cost. 

Reality bites. And models model biological reality. The section that follows next 
contains introductions to the data that is available for systems biology and the 
caveats that go with the data. It also contains introductions to inferring model 
architecture from data, using control theory in models, and studying synthetic gene 
networks. The antidote to these computational limitations is multi-level modeling, 
and this is also introduced in this section. Limitations in observability, accuracy, 
and coverage of biological data are widely recognized. One of the goals of this 
section is to guide the readers through various data interpretation methods while 
emphasizing what the data will or will not allow in terms of quantitative analysis. 

The last section of the book contains the computational issues and techniques for 
practical application of the preceding approaches: numerical methods for simulating 
biochemical systems, and the software infrastructure for representing models in a 
reusable and exchangeable manner. Biological data quality is not the only obstacle 
systems biology is facing. The various numerical methods also have their well 
known strengths and limitations and these should be considered when designing 
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experiments and their associated models. For instance, computational limitations 
form barriers to increasing model size arbitrarily. 

The book ends with an eclectic list of the software tools that the contributing 
authors of this book find useful. 

While this book contains a plethora of approaches to biological modeling, we are 
keenly aware that there are many that we have not covered. For instance, we have 
eschewed much discussion of pattern recognition because this is only really useful 
when combined with domain specific biological knowledge—for which no general 
technique exists. Likewise, we do not cover approaches such as neural networks 
or Petri nets that have either limited application in systems biology so far, or are 
problematic regarding model interpretation. Our attempt has been to provide broad 
basic coverage of fundamental approaches and techniques. In our view, picking some 
of the techniques introduced in this book and combining them artfully leads to 
almost complete coverage of modeling in systems biology. 





Enjoy 


Systems biology is an approach to quantitatively understand biological systems 
that attempts to embrace the complexity of life as a fact of life. There is no 
hope of understanding biological systems at the predictive level required for disease 
detection, prevention, or cure other than by this means. Nevertheless, it would serve 
us well to temper Burnham’s maxim of grand thinking, “Make no little plans ...” 
with the story of the emperor’s new clothes. 





I GENERAL CONCEPTS 


1 The Role of Modeling in Systems Biology 


Douglas B. Kell and Joshua D. Knowles 


The use of models in biology is at once both familiar and arcane. It is familiar 
because, as we shall argue, biologists presently and regularly use models as ab- 
stractions of reality: diagrams, laws, graphs, plots, relationships, chemical formulae 
and so on are all essentially models of some external reality that we are trying to 
describe and understand (fig. 1.1). In the same way we use and speak of “model 
organisms” such as baker’s yeast or Arabidopsis thaliana, whose role lies in being 
similar to many organisms without being the same as any other one. Indeed, our 
theories and hypotheses about biological objects and systems are in one sense also 
just models (Vayttaden et al., 2004). Yet the use of models is for most biologists 
arcane because familiarity with a subset of model types, especially quantitative 
mathematical models, has lain outside the mainstream during the last 50 years of 
the purposely reductionist and qualitative era of molecular biology. It is largely 
these types of model that are an integral part of the “new” (and not-so-new) sys- 
tems biology and on which much of the rest of this book concentrates. Since all 
such models are developed for some kind of a purpose, our role in part is to explain 
why this type of mathematical model is both useful and important, and will likely 
become part of the standard armory of successful biologists. 





1.1 Philosophical Overview 


When one admits that nothing is certain one must, I think, also admit that some 
things are much more nearly certain than others. 


Bertrand Russell, Am I an Atheist or an Agnostic? 


It is conventional to discriminate (as in fig. 1.2) (a) the world of ideas, thoughts, 
or other mental constructs and (b) the world of observations or data, and most 
scientists would recognize that they are linked in an iterative cycle, as drawn: we 
improve our mental picture of the world by carrying out experiments that produce 
data, and such data are used to inform the cogitations that feed into the next part 
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Figure 1.1 Models in biology. Although we shall be concentrating here on a subset 
of mathematical models, we would stress that the use of all sorts of models is entirely 
commonplace in biology—examples include (a) diagrams (here a sequence of DNA bases 
and the “central dogma”), (b) laws (the flux-control summation theorem of metabolic 
control analysis), (c) graphs—in the mathematical sense of elements with nodes and edges 
(a biochemical pathway), (d) plots (covariation of 2 metabolites in a series of experiments), 
(e) relationships (a rule describing the use of the concentration of a metabolite in disease 
diagnosis), (f) chemical formulae (tryptophan), and (g) images (of mammalian cells). 


of the right-hand arc, that designs and performs the next set of experiments as 
part of an experimental program. Such a cycle may be seen as a “chicken and egg” 
cycle, but for any individual turn of the cycle there is a clear distinction between 
the two essential starting points (ideas or data). This also occurs in scientific 
funding circles—is the activity in question ideas- (that is, hypothesis-)driven or 
is it data-driven? (Until recently, the latter, hypothesis-generating approach was 
usually treated rather scornfully.) 

From a philosophical point of view, then, the hypothetico-deductive analysis, in 
which an idea is the starting point (however muddled or wrongheaded that idea 
may be), has been seen as much more secure, since deductive reasoning is sound 
in the sense that if an axiom is true (as it is supposed to be by definition) and 
the observation is true, we can conclude that the facts are at least consistent with 
the idea. If the hypothesis is “all swans are white” then the prediction is that a 
measurement of the whiteness of known swans will give a positive response. By 
contrast, it has been known since the time of Hume that inductive reasoning, by 
which we seek to generalize from examples (“swan A is white, swan B is white, 
swan C is white ... so I predict that all swans are white”) is insecure—and a 
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Figure 1.2 The iterative relationship between the world of ideas/hypotheses/thoughts 
and the world of data/observations. Note that these are linked in a cycle, in which one 
arc is not simply the reverse of the other (Kell, 2002, 2005; Kell and Welch, 1991). 





single black swan shows it. Nothing will ever change that, and the “problem of 
induction” probably lies at the heart of Popper’s insistence (see Popper (1992) 
and more readable commentators such as Medawar (1982)) that theories can only 
be disproved. Note of course that it is equally true for the hypothetico-deductive 
mode of reasoning that a single black swan will disprove the hypothesis. This said, 
the ability of scientists to ignore any number of ugly facts that would otherwise 
slay a beautiful hypothesis is well known (Gilbert and Mulkay, 1984), and in this 
sense—given that there are no genuinely secure axioms (Hofstadter, 1979; Nagel and 
Newman, 2002)—the deductive mode of reasoning is not truly much more secure 
than is induction. 

Happily, there is emerging a more balanced view of the world. This recognizes 
that for working scientists the reductionist and ostensibly solely hypothesis-driven 
agenda has not been as fruitful as had been expected. In large measure in biology 
this realization has been driven by the recognition, following the systematic genome 
sequencing programs, that the existence, let alone the function, of many or most 
genes—even in well-worked model organisms—had not been recorded. This could be 
seen in part as a failure of the reductionist agenda. In addition there are many areas 
of scientific activity that have nothing to do with testing hypotheses but which are 
exceptionally important (Kell and Oliver, 2004); perhaps chief among these is the 
development of novel methods. In particular there are fields—functional genomics 
not least among them (Kell and King, 2000), although this is very true for many 
areas of medicine as well—that are data-rich but hypothesis-poor, and are best 
attacked using methods that are data-driven and thus essentially inductive (Kell 
and King, 2000). 

A second feature that has emerged from a Popperian view of the world (or 
at least from his attempt to find a means that would allow one to discriminate 
“science” from “pseudo-science” (Medawar, 1982; Popper, 1992)) is the intellectual 
significance of prediction: if your hypothesis makes an experimentally testable (and 


6 The Role of Modeling in Systems Biology 


thus falsifiable) prediction it counts as “science,” and if the experimental prediction 
is consistent with the prediction then (confidence in) the “correctness” of your 
hypothesis or worldview is bolstered (see also Lipton (2005)). 





1.2 Historical Context 


The history of science demonstrates that both inductive and deductive reasoning 
occur at different stages in the development of ideas. In some cases, such as in the 
history of chemistry, a period of almost purely inductive reasoning (stamp-collecting 
and classification) is followed by the development of more powerful theories that 
seek to explain and predict many phenomena from more general principles. Often 
these theories are reductionist, that is to say, complicated phenomena that seem 
to elude coherent explanation are understood by some form of breaking down into 
constituent parts, the consideration of which yields the required explanation of 
the more complicated system. A prime example of the reductionist mode is the 
explanation of the macroscopic properties of solids, liquids, and gases—such as 
their temperature, pressure, and heat— by considering the average effect of a 
large number of microscopic interactions between particles, governed by Newtonian 
mechanics. For the first time, accurate, quantitative predictions with accompanying, 
plausible explanations were possible, and unified much of our basic understanding 
of the physical properties of matter. 

The success of early reductionist models in physics, and later those in chemistry, 
led in 1847 to a program to analyze (biological) processes, such as urine secretion 
or nerve conduction, in physico-chemical terms proposed by Ludwig, Helmholtz, 
Brucke, and du Bois-Reymond (Bynum et al., 1981). However, although reduction- 
ism has been successful in large part in the development of physics and chemistry, 
and to a great extent in acquiring the parts list for modern biology—consider the 
gene—the properties of many systems resist a reductionist explanation (Solé and 
Goodwin, 2000). This ultimate failure of reductionism in biology, as in other dis- 
ciplines, is due to a number of factors, principal among them being the fact that 
biological systems are inherently complex. 

Although complexity is a phenomenon about which little agreement has been 
reached, and certainly for which no all-encompassing measure has been established, 
the concept is understood to pertain to systems of interacting parts. Having many 
parts is not necessary: it is sufficient that they are coupled in some way, so that the 
state of one of them affects the state of one or more others. Often the interactions are 
nonlinear so, unlike systems which can be modeled by considering averaged effects, 
it is not possible to reduce the system’s behavior to the sum of its parts (Davey 
and Kell, 1996). Common interactions in these systems are feedback loops, in which, 
as the name suggests, information from the output of a system transformation is 
sent back to the input of the system. If the new input facilitates and accelerates 
the transformation in the same direction as the preceding output, they are positive 
feedback —their effects are cumulative. If the new data produce an output in the 
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opposite direction to previous outputs, they are negative feedback—their effects 
stabilize the system. In the first case there is exponential growth or decline; in the 
second there is maintenance of the equilibrium. These loops have been studied in 
a variety of fields, including control engineering, cybernetics, and economics. An 
understanding of them and their effects is central to building and understanding 
models of complex systems (Kell, 2004, 2005; Milo et al., 2002). 

Negative feedback loops are typically responsible for regulation, and they are 
obviously central to homeostasis in biological systems. In control engineering, 
such systems are conveniently described using Laplace transforms—a means of 
simplifying the combination and manipulation of ordinary differential equations 
(ODEs), and closely related to the Fourier transform (Ogata, 2001); Laplace 
transforms for a large variety of different standard feedback loops are known and 
well-understood, though analysis and understanding of non-linear feedback remains 
difficult (see chapter 12 for details). Classical negative feedback loops are considered 
to provide stability (as indeed they do when in simple systems in which the feedback 
is fast and effective), though we note that negative feedback systems incorporating 
delays can generate oscillations (for example (Nelson et al., 2004)). 

Positive feedback is a rather less appreciated concept for most people and, until 
recently, it could be all but passed over in even a control engineer’s education. This 
is perhaps because it is often equated with undesired instability in a system, so it 
is just seen as a nuisance; something which should be reduced as much as possible. 
However, positive feedback should not really be viewed in this way, particularly from 
a modeling perspective, because it is an important factor in the dynamics of many 
complex systems and does lead to very familiar behavior. One very simple model 
system of positive feedback is the Polya urn (Arthur, 1963; Barabasi and Albert, 
1999; Johnson and Kotz, 1977). In this, one begins with a large urn containing two 
balls, one red and one black. One of these is removed. It is then replaced in the 
urn, together with another ball of the same color. This process is repeated until 
the urn is filled up. The system exhibits a number of important characteristics 
with respect to the distribution of the two colors of balls in the full urn: early, 
essentially random events can have a very large effect on the outcome; there is a 
lock-in effect where later in the process, it becomes increasingly unlikely that the 
path of choices will shift from one to another (notice that this is in contrast to the 
“positive feedback causes instability” view); and accidental events early on do not 
cancel each other out. The Polya urn is a model for such things as genetic drift in 
evolution, preferential attachment in explaining the growth of scale-free networks 
(Barabasi and Albert, 1999), and the phenomenon whereby one of a variety of 
competing technologies (all but) takes over in a market where there is a tendency for 
purchasers to prefer the leading technology, despite equal, or even inferior, quality 
compared with the others (for example QWERTY keyboards and Betamax versus 
VHS video). (See also Goldberg (2002) and Kauffman et al. (2000) for the adoption 
of technologies as an evolutionary process.) 

Positive feedback in a resource-limited environment also leads to familiar be- 
havior. The fluctuations seen in stock prices, the variety of sizes of sandpiles, and 
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cycles of population growth and collapse in food-chains all result from this kind of 
feedback. There is a tendency to reinforce the growth of a variable until it reaches 
a value that cannot be sustained. This leads to a crash which “corrects” the value 
again, making way for another rise. Such cyclic behavior can be predictably peri- 
odic but in many cases the period of the cycle is chaottc—that is, deterministic but 
essentially unpredictable. All chaotic systems involve nonlinearity, and this is most 
frequently the result of some form of positive feedback, usually mixed with negative 
feedback (Glendinning, 1994; Tufillaro et al., 1992; Strogatz, 2000). 

Behavior involving oscillatory patterns may also be important in biological 
signaling (Lahav et al., 2004; Nelson et al., 2004), where the downstream detection 
may be in the frequency rather than the amplitude (that is, simply concentration) 
domain (Kell, 2005). All of this said, despite encouraging progress (for example 
(Tyson et al., 2003; Wolf and Arkin, 2003; Yeger-Lotem et al., 2004)), we are far 
from having a full understanding of the behavior of concatenations of these simple 
motifs and loops. Thus, the Elowitz and Leibler oscillator (Elowitz and Leibler, 
2000) is based solely on negative feedback loops but is unstable. However, this 
system could be made comparatively stable and robust by incorporating positive 
feedback loops, which led to some interesting work by Ferrell on the cell cycle 
(Angeli et al., 2004; Pomerening et al., 2003). 

It is now believed that most systems involving interacting elements have both 
chaotic and stable regions or phases, with islands of chaos existing within stable 
regions, and vice versa (for a biological example, see (Davey and Kell, 1996)). 
Chaotic behavior has now been observed even in the archetypal, clockwork system 
of planetary motion, whereas the eye at the heart of a storm is an example of 
stability occurring within a wildly unpredictable whole. 

Closely related to the vocabulary of complexity and of chaos theory is the slippery 
new (or not so new?) concept of emergence (Davies, 2004; Holland, 1998; Johnson, 
2001; Kauffman, 2000; Morowitz, 2002). Emergence is generally taken to mean 
simply that the whole is more than (and maybe qualitatively different from) the 
sum of its parts, or that system-level characteristics are not easily derivable from the 
“local” properties of their constituents. The label of emergent phenomenon is being 
applied more and more in biological processes at many different levels, from how 
proteins can fold to how whole ecosystems evolve over time. A central question that 
the use of the term emergence forces us to consider is whether it is only a convenient 
way of saying that the behavior of the whole system is difficult to understand 
in terms of basic laws and the initial conditions of the system elements (weak 
emergence), or whether, in contrast, the whole cannot be understood by the analysis 
of the parts, and current laws of physics, even in principle (strong emergence). The 
latter view would imply that high level phenomena are not reducible to physical 
laws (but may be consistent with them) (Davies, 2004). If this were true, then the 
modeling of (at least) some biological processes should not follow solely a bottom- 
up approach, hoping to go from simple laws to the desired phenomenon, but might 
eventually need us to posit high-level organizing principles and even downward 
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causality. Such a worldview is completely antithetical to materialism and remains 
as yet on the fringes of scientific thought. 

In summary, reductionism has been highly successful in explaining some macro- 
scopic phenomena, purely in terms of the behavior of constituent parts. However, 
this was predicated (implicitly) on the assumption that there were few parts (for ex- 
ample, the planets) and that their interactions were simple, or that there were many 
parts but their interactions could be neglected (for example, molecules in a gas). 
However, the scope of a reductionist approach is limited because these assumptions 
are not true in many systems of interest (Kell and Welch, 1991; Solé and Good- 
win, 2000). The advent of computers and computer simulations led to the insight 
that even relatively small systems of interacting parts (such as the Lorenz model) 
could exhibit very complex (even chaotic) behavior. Although the behavior may be 
deterministic, complex systems are hard to analyze using traditional mathematical 
and analytical methods. Prediction, control, and understanding arise mainly from 
modeling these systems using iterated computer simulations. Biological systems, 
which are inherently complex, must be modeled and studied in this way if we are 
to continue to make strides in our understanding of these phenomena. 





1.3 The Purposes and Implications of Modeling 


We take it as essentially axiomatic that the purposes of academic biological research 
are to allow us to understand more than we presently do about the behavior and 
workings of biological systems (see also Klipp et al. (2005)) (and in due time to 
exploit that knowledge for agricultural, medical, commercial, or other purposes). 
We consider that there are several main reasons why one would wish to make models 
of biological systems and processes, and we consider each in turn. In summary, they 
can all be characterized as variations of simulation and prediction. By simulation 
we mean the production of a mathematical or computational model of a system or 
subsystem that seeks to represent or reproduce some properties that that system 
displays. Although often portrayed as substantially different (though we consider 
that it is not), prediction involves the production of a similar type of mathematical 
model that simulates (and then predicts) the behavior of a system related to the 
starting system described above. Clearly simulation and prediction are thus related 
to each other, and the important concept of generalization describes the ability of 
a model derived for one purpose to predict the properties of a related system under 
a separate set of conditions. Thus some of the broad reasons—indeed probably the 
main reasons—why one would wish to model a (biological) system include: 


a Testing whether the model is accurate, in the sense that it reflects—or can be 
made to reflect—known experimental facts 


a Analyzing the model to understand which parts of the system contribute most to 
some desired properties of interest 
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= Hypothesis generation and testing, allowing one rapidly to analyze the effects 
of manipulating experimental conditions in the model without having to perform 
complex and costly experiments (or to restrict the number that are performed) 


ma Testing what changes in the model would improve the consistency of its behavior 
with experimental observations 


Our view of the basic bottom-up systems biology agenda is given in fig. 1.3. 
1.3.1 Testing Whether the Model Is Accurate 


A significant milestone in a modeling program is the successful representation of the 
behavior of the “real” system by a model. This does not, of course, mean that the 
model is accurate, but it does mean that it might be. Thus the dynamical behavior 
of variables such as concentrations and fluxes is governed by the parameters of the 
systems such as the equations describing the local properties and the parameters 
of those equations. This of itself is not sufficient, since generalized equations (for 
example, power laws, polynomials, perceptrons with nonlinear properties) with 
no mechanistic or biological meaning can sometimes reproduce well the kinetic 
behavior of complex systems without giving the desired insight into the true 
constitution of the system. 

Such models may also be used when one has no experimental data, with a 
view to establishing whether a particular design is sensible or whether a particular 
experiment is worth doing. In the former case, of engineering design, it is nowadays 
commonplace to design complex structures such as electronic circuits and chips, 
buildings, cars, or aeroplanes entirely inside a computer before committing them 
to reality. Famously, the Boeing 777 was designed entirely in silico before being 
tested first in a wind tunnel and then with a human pilot. It is especially this kind 
of attitude and experience in the various fields of engineering that differs from the 
current status of work in biology that is leading many to wish to bring numerical 
modeling into the biological mainstream. Another example is the development of 
“virtual” screening, in which the ability of drugs to bind to proteins is tested in silico 
using structural models and appropriate force fields to calculate the free energy of 
binding to the target protein of interest of ligands in different conformations (Böhm 
and Schneider, 2000; Klebe, 2000; Langer and Hoffmann, 2001; Shen et al., 2003; 
Zanders et al., 2002), the most promising of which may then be synthesized and 
tested. The attraction, of course, is the enormous speed and favorable economics 
(and scalability) of the virtual over the actual “wet” screen. 


1.3.2 Analyzing Subsystem Contributions 


Having a model allows one to analyze it in a variety of ways, but a chief one is 
to establish those parts of the model that are most important for determining the 
behavior in which one is particularly interested. This is because simple inspection 
of models with complex (or even simple) feedback loops just does not allow one 
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Bottom-up Systems Biology Pipeline (Dry) 


1. 


Qualitative (structural) model—who talks to 
whom as substrate, product, or effector > 
Quantitative model including “real” or 
approximate equations describing individual 
steps > 

Parametrisation of those equations > 

Run the model and assess its most important 
parameters 


Iteratively, with wet data, GOTO 1.... 


Systems Biology Experiments 
(Including the Wet Side) 


Set up a well-defined system 


Effect systematic perturbations (genetic, 
environmental, chemical) 

Measure a time series of as many 
concentrations of variables, especially RNAs, 
proteins, metabolites (the ’omes) as possible 
Model the system and compare the experimental 
time series to those generated by the model 


Repeat iteratively 


Figure 1.3 The role of modeling in the basic systems biology agenda, (a) stressing the 
bottom-up element while showing the iterative and complementary top-down analyses. 
(b) The development of a model from qualitative (structural) to quantitative, and (c) its 
integration with (“wet”) experimentation. 


to understand them (Westerhoff and Kell, 1987). Techniques such as sensitivity 
analysis (see below) are designed for this, and thus indicate to the experimenter 


which parameters must be known with the highest precision and should be the focus 


of experimental endeavor. This is often the focus of so-called top-down analyses 


in which we seek to analyze systems in comparatively general or high-level terms, 


lumping together subsystems in order to make the systems easier to understand. The 
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equivalent in pharmacophore screening is the QSAR (quantitative structure-activity 
relationship) type of analysis, from which one seeks to analyze those features of a 
candidate binding molecule that best account for successful binding, with a view 
to developing yet more selective binding agents. 


1.3.3 Hypothesis Generation and Testing 


Related to the above is the ability to vary, for example, parameters of the model, and 
thereby establish combinations or areas of the model’s space that show particular 
properties in which one might be interested (Pritchard and Kell, 2002), and then 
to perform that small subset of possible experiments that it is predicted will 
show such interesting behavior. An example here might be the analysis of which 
multiple modulations of enzymatic properties are best performed for the purposes 
of metabolic engineering (Cascante et al., 2002; Cornish-Bowden, 1999; Fell, 1998). 
We note also that when modeling can be applied effectively it is far cheaper than 
wet biology and, as well as its use in metabolic engineering, can reduce the reliance 
on in vivo animal/human experimentation (a factor of significant importance in the 
pharmaceutical industry). 


1.3.4 Improving Model Consistency 


In a similar vein, we may have existing experimental data with which the model 
is inconsistent, and it is desirable to explore different models to see which changes 
to them might best reproduce the experimental data. In biology this might, for 
example, allow the experimenter to test for the presence of an interaction or kinetic 
property that might be proposed. In a more general or high-level sense, we may use 
such models to seek evidence that existing hypotheses are wrong, that the model 
is inadequate, that hidden variables need to be invoked (as in the Higgs Boson in 
particle physics, or the invocation of the existence of Pluto following the registration 
of anomalies in the orbit of Neptune), that existing data are inadequate, or that 
new theories are needed (such as the invention of the quantum theory to explain or 
at least get round the so-called “ultraviolet catastrophe”). In kinetic modeling this is 
often the case with “inverse problems” in which one is seeking to find a (“forward”) 
model that best explains a time series of experimental data (see below). 





1.4 Different Kinds of Models 


Most of the kinds of systems that are likely to be of interest to readers of this 
book involve entities (metabolites, signaling molecules, etc.) that can be cast as 
“nodes” interacting with each other via “edges” representing reactions that may be 
catalyzed via other substances such as enzymes. These will also typically involve 
feedback loops in which some of the nodes interact directly with the edges. We refer 
to the basic constitution of this kind of representation as a structural model (not, 
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of course, to be confused with a similar term used in the bioinformatic modeling of 
protein molecular structures). A typical example of a structural model is shown in 
fig. 1.4. 


The elements of a model always include the structural 
relationships (such as shown), the “local” equations 
describing the behaviour of each step (not shown) and 
the values of their parameters (not shown) 


qusseeeeemenmenmenaenay, 
. oy 
. 
ý . 
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Figure 1.4 A structural model of a simple network involving nine enzymes (E1 to E9), 
four external metabolites (A,J,K,L—whose concentration must be assumed to be fixed 
if a steady state is to be attained), and eight internal metabolites (B,C,D,E,F,G,H,I). D 
and E are effectively cofactors and are part of a ‘moiety-conserved cycle’ (Hofmeyr et al., 
1986) in that their sum is fixed and they cannot vary their concentrations independently 
of each other. 


The classical modeling strategy in biology (and in engineering), the ordinary 
differential equation (ODE) approach (discussed in chapter 6) contains three initial 
phases, and starts with this kind of structural model, in which the reactions and 
effectors are known. The next level refers to the kinetic rate equations describing 
the “local” properties of each edge (enzyme), for instance that relate the rate 
of the reaction catalyzed by, say, El to the concentrations of its substrates; a 
typical such equation (which assumes that the reaction is irreversible) is the 
Henri-Michaelis-Menten equation v = V maz -|S]/([S] + Km). The third level involves 
the parameterization of the model, in terms of providing values for the parameters 
(in this case Vmaz and Km. Armed with such knowledge, any number of software 
packages can predict the time evolution of the variables (the concentrations and 
fluxes of the metabolites) until they may reach a steady state. This is done 
(internally) by recasting the system as a series of coupled ordinary differential 
equations which are then solved numerically. We refer to this type of operation as 
forward modeling, and provided that the structural model, equations, and values 
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of the parameters are known, it is comparatively easy to produce such models 
and compare them with an experimental reality. We have been involved with 
the simulator Gepasi, written by Pedro Mendes (Mendes, 1997; Mendes and Kell, 
1998, 2001), which allows one to do all of the above, and that in addition permits 
automated variation of the parameters with which to satisfy an objective function 
such as the attainment of a particular flux in the steady state (Mendes and Kell, 
1998). 

In such cases, however, the experimental data that are most readily available 
do not include the parameters at all, and are simply measurements of the (time- 
dependent) variables, of which fluxes and concentrations are the most common (see 
chapter 10). Comparison of the data with the forward model is much more difficult, 
as we have to solve an inverse modeling, reverse engineering or system identification 
(Ljung, 1999b) problem (discussed in chapter 11). Direct solution of such problems 
is essentially impossible, as they are normally hugely underdetermined and do 
not have an analytical solution. The normal approach is thus an iterative one in 
which a candidate set of parameters is proposed, the system run in the forward 
direction, and on the basis of some metric of closeness to the desired output a new 
set of parameters is tested. Eventually (assuming that the structural model and 
the equations are adequate), a satisfactory set of parameters, and hence solutions, 
will be found (see table 1.1). These methods are much more computer-intensive 
than those required for simple forward modeling, as potentially many thousands or 
even millions of candidate models must be tested. Modern approaches to inverse 
modeling use approaches from heuristic optimization (Corne et al., 1999) to search 
the model space efficiently. Recent advances in multiobjective optimization (Fonseca 
and Fleming, 1996) are particularly promising in this regard, since the quality of 
a model can usually be evaluated only by considering several, often conflicting 
criteria. Evolutionary computation approaches (Deb, 2001) allow exploration of the 
Pareto front, that is the different trade-offs (for example, between model simplicity 
and accuracy) that can be achieved, enabling the modeler to make more informed 
choices about preferred solutions. 

We note, however, that there are a number of other modeling strategies and issues 
that may lead one to wish to choose different types of model from that described. 
First, the ODE model assumes that compartments are well stirred and that the 
concentrations of the participants are sufficiently great as to permit fluctuations 
to be ignored. If this is not the case then stochastic simulations (SS) are required 
(Andrews and Bray, 2004) (which are topics of chapter 8 and chapter 16). If fow of 
substances between many contiguous compartments is involved, and knowledge of 
the spatial dynamics is required (as is common in computational fluid dynamics), 
partial differential equations (PDEs) are necessary. SS and PDE models are again 
much more computationally intensive, although in the latter case the designation 
of a smaller subset of representative compartments may be effective (Mendes and 
Kell, 2001). 

If the equations and parameters are absent, it may prove fruitful to use qualitative 
models (Hunt et al., 1993), in which only the direction of change (and maybe rate 
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Table 


1.1 10 Steps in (Inverse) Modeling. 
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10. 


Get acquainted with the target system to be modeled 
Identify important variable(s) that changes over time 
Identify other key variables and their interconnections 
Decide what to measure and collect data 

Decide on the form of model and its architecture 


Construct a model by specifying all parameters. Run the model 


forward and measure behavior. 


Compare model with measurements. If model is improving return 
to 6. If model is not improving and not satisfactory, return to 3, 4, 


and 5. 


Perform sensitivity analysis. Return to 6 and 7 if necessary. 
Test the impact of control policies, initial conditions, etc. 


Use multicriteria decision-making (MCDM) to analyze policy trade- 


offs. 
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of change) is recorded, in an attempt to constrain the otherwise huge search space 
of possible structural models (see chapter 7). Similarly, models may invoke discrete 
or continuous time, they may be macro or micro, and they may be at a single level 


(such as metabolism, signaling) or at multiple levels (in which the concentrations 
of metabolites affect gene expression and vice versa (ter Kuile and Westerhoff, 
2001). Models may be top-down (involving large “blocks”) or bottom-up (based on 
elementary reactions), and analyses beneficially use both strategies (fig. 1.3). Thus a 
“middle-out” strategy is preferred by some authors (Noble, 2003a) (see chapter 14). 
Table 1.2 sets out some of the issues in terms of choices which the modeler may 
face in deciding which type of model may be best for particular purposes and on 
the basis of the available amount of knowledge of the system. 


Table 1.2: Different types of model, presented as choices facing the 
experimenter when deciding which strategy or strategies may be 


most appropriate for a given problem. 








istic 


or determin- 


or statistical distributions 
Deterministic: equations such as 
ODEs 


Dimension | Possible choices Comments 
or Feature 
Stochastic Stochastic: Monte Carlo methods | Phenomena are not of themselves either 


stochastic or deterministic; large-scale, 
linear systems can be modeled deter- 
ministically, while a stochastic model 
is often more appropriate when nonlin- 
earity is present. 








Discrete ver- 
sus continu- 
ous (in time) 


Discrete: Discrete event simula- 
tion, for example, Markov chains, 
cellular automata, Boolean net- 
works. 

Continuous: Rate equations. 








Discrete time is favored when variables 
only change when specific events occur 
(modeling queues). Continuous time is 
favored when variables are in constant 
flux. 
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Table 1.2: Different types of model, presented as choices facing the 
experimenter when deciding which strategy or strategies may be 
most appropriate for a given problem. 





Dimension | Possible choices Comments 
or Feature 





Macroscopic Microscopic: Model individual | Are the individual particles or subsys- 








versus mi- | particles in a system and compute | tems important to the evolution of the 
croscopic averaged effects as necessary. system, or is it enough to approximate 
Macroscopic: Model averaged ef- | them by statistical moments or ensem- 
fects themselves, for example, con- | ble averages? 
centrations, temperatures, etc. 
Hierarchical Hierarchical: Fully modular net- | Can some processes/variables in the 
versus works. system be hidden inside modules or ob- 
multi-level Multi-level: Loosely connected | jects that interact with other modules, 
components. or do all the variables interact, poten- 
tially? This relates to reductionism ver- 
sus holism. 
Fully quan- | Qualitative: Direction of change | Reducing the quantitative accuracy 
titative ver- | modeled only, or on/off states | of the model can reduce complexity 
sus partially | (Boolean network). greatly and many phenomena may still 
quantita- Partially quantitative: Fuzzy mod- | be modeled adequately. 


tive versus | els. 
qualitative Fully quantitative: ODEs, PDEs, 
microscopic particle models. 





Predictive Predictive: Specify every variable | If a model is being used for precise pre- 
versus that could affect outcome. diction or forecasting of a future event, 
exploratory /ex- Exploratory: Only consider some | all variables need to be considered. The 
planatory variables of interest. exploratory approach can be less pre- 


cise but should be more flexible, for ex- 
ample, allowing different control poli- 
cies to be tested. 








Estimating Rare events: Use importance sam- | Estimation of rare events, such as apop- 
rare events | pling. tosis times in cells is time-consuming 
versus typi- | Typical behavior: Importance | if standard Monte Carlo simulation is 
cal behavior sampling not needed. used. Importance sampling can be used 
to speed up the simulation. 
Lumped_ or | Lumped: Treat cells or other com- | If heterogeneous it may be necessary to 
spatially ponents/compartments as spa- | use the computationally intensive par- 
segregated tially homogeneous. tial differential equation, though other 


Spatially segregated: Treat the | solutions are possible (Mendes and 
components as differentiated or | Kell, 2001) 
spatially heterogeneous. 




















1.5 Sensitivity Analysis 
-Sensitivity analysis for modelers? 
-Would you go to an orthopaedist who didn’t use X-ray? 
Jean-Marie Furbringer 


Sensitivity analysis (Saltelli et al., 2000) represents a cornerstone in our analysis of 
complex systems. It asks the generalized question “what is the effect of changing 
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something (a parameter P) in the model on the behavior of some variable element 
M of the model?” To avoid the magnitude of the answer depending on the units used 
we use fractional changes AP and observe their effects via fractional changes (AM) 
in M. Thus the generalized sensitivity is (AM/M)/(AP/P) and in the limit of small 
changes (where the sensitivity is then independent of the size of AP) the sensitivity 
is (AM/M)/(dP/P) = d(InM)/d(InP). The sensitivities are thus conceptually and 
numerically the same as the control coefficients of metabolic control analysis (MCA) 
(see Fell (1996); Heinrich and Schuster (1996); and Kell and Westerhoff (1986)). 
Reasons for doing sensitivity analysis include the ability to determine: 


1. If a model resembles the system or process under study 


2. Factors that may contribute to output variability and so need the most consid- 
eration 


3. The model parameters that can be eliminated if one wishes to simplify the model 
without altering its behavior grossly 


4. The region in the space of input variables for which model variation is maximum 
5. The optimal region for use in a calibration study 


6. If and which groups of factors interact with each other. 


A basic prescription for performing sensitivity analysis (adapted from (Saltelli 
et al., 2000)) is: 


1. Identify the purpose of the model and determine which variables should concern 
the analysis. 


2. Assign ranges of variation to each input variable. 
3. Generate an input vector matrix through an appropriate design (DoE). 
4. Evaluate the model, thus creating an output distribution or response. 


5. Assess the influence of each variable or group of variables using correla- 
tion/regression, Bayesian inference (chapter 4), machine learning, or other methods. 


Two examples from our recent work illustrate some of these issues. In the first, 
(Nelson et al., 2004; Ihekwaba et al., 2004), we studied a refined version of a model 
(Hoffmann et al., 2002) of the NF-KB pathway. This contained 64 reactions with 
their attendant parameters, but sensitivity analysis showed that only 8-9 of them 
exerted significant influence on the dynamics of the nuclear concentration of NF- 
KB in this system, and that each of these reactions involved free IkBa and free 
IKK. An entirely different study (White and Kell, 2004) asked whether comparative 
genomics and experimental data could be used to rank candidate gene products in 
terms of their utility as antimicrobial drug targets. The contribution of each of 
the submetrics (such as essentiality, or existence only in pathogens and not hosts 
or commensals) to the overall metric was analyzed by sensitivity analysis using 3 
different weighting functions, with the top 3 targets— which were quite different 
from those of traditional antibiotics—being similar in all cases. This gave much 
confidence in the robustness of the conclusions drawn. 
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1.6 Concluding Remarks 


The purpose of this chapter was to give an overview of some of the reasons for 
seeking to model complex cellular biological systems, and this we trust that we have 
done. We have also given a very brief overview of some of the methods, but we have 
not dwelt in detail on: their differences, the question of which modeling strategies to 
exploit in particular cases, the problems of overdetermination (where many models 
can fit the same data) and of model choice (which model one might then prefer and 
why), nor on available models (for example, at http://www.biomodels.net/) and 
model exchange using, for example, the systems biology markup language (SBML) 
(http://www.sbml.org) (Finney and Hucka, 2003; Hucka et al., 2003; Shapiro et al., 
2004) or others (Lloyd et al., 2004). These issues are all covered well in the other 
chapters of this book. 

Finally, we note here that despite the many positive advantages of the modeling 
approach, biologists are generally less comfortable with, and confident in, models 
(and even theories) than are practitioners in some other fields where this is more 
of a core activity, such as physics or engineering. Indeed, when Einstein was once 
informed that an experimental result disagreed with his theory of relativity, he 
famously and correctly remarked “Well, then, the experiment is wrong!” It is our 
hope that trust will grow, not only from a growing number of successful modeling 
endeavors, but also from a greater and clearer communication of models enabled 
by new technologies such as Web services and the SBML. 
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2 Complexity and Robustness of 
Cellular Systems 


Jörg Stelling, Uwe Sauer, Francis J. Doyle III, and John Doyle 


The daunting complexity of cellular systems appears as a major hurdle for large- 
scale modeling efforts. This complexity resides not only in the sheer number of 
components and interactions, but also in the operations on multiple levels and 
time-scales. Guidelines for meaningful modeling such as underlying organizational 
and design principles are thus required. A key to derive guidelines could be the high 
internal organization and the selection for function that distinguish cellular systems 
from complex physical systems; both factors considerably shrink the space of 
possible designs. One prominent aspect of cellular functions is their robustness, that 
is, their insensitivity to a wide range of perturbations. Here, we focus on connections 
between cellular complexity and robustness—with robustness requirements being 
the driving forces for complexity. Since only a rather limited set of mechanisms 
establishes robustness in biological circuits, understanding robustness can provide a 
key for understanding cellular organization. Practical implications for the modeling 
task are, for instance, the emphasis on network structures over exact values of 
kinetic parameters. Thus, we advocate that qualitative or structural modeling 
approaches may already yield deep insights by identifying important versus less 
important parts of a system for the purpose of more detailed modeling. 





2.1 Introduction 


Complexity is a hallmark of cellular systems, with great challenges for the devel- 
opment and analysis of cellular networks at the system level. Without appropriate 
conceptional frameworks for dealing with that complexity, the vision of ultimately 
going from the description of entire cells to organs and organisms will not be achiev- 
able. Hence, it is important to think about rather high-level abstractions of cellular 
properties that could help in system modeling and analysis. In general, complex sys- 
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tems may either show a behavior or a design that is difficult to understand (Weng 
et al., 1999). While the behavior of biological systems is, in most cases, relatively 
simple, the numbers of metabolic and regulatory genes shows that complexity in 
biology arises mainly from abundant control circuits, that is, from the system’s 
design. 

For maintaining simple behavior under real-life conditions, biological systems 
have to cope with a constantly varying environment, be it changing physico- 
chemical conditions or noisy external signals that have to be processed. Moreover, 
their internal properties are also subject to uncertainty, since they can, for instance, 
be changed by mutations, and because stochastic noise is an important source of 
cellular variability. Therefore, evolution must have strongly favored robustness, that 
is, a system’s ability to maintain (key) functional characteristics despite potentially 
harmful external or internal perturbations. A now widely accepted notion is that 
many (or most) cellular sub-systems are robust (Kitano, 2002a; Stelling et al., 
2004b; Kitano, 2004b). Examples for this capacity can already be found in simple 
organisms such as the bacterium Escherichia coli, which displays robust perfect 
adaptation in its search for nutrients (see chapter 12) and also a high resistance to 
gene deletions (see section 2.4). 

Robustness has long been recognized as an important property of biological sys- 
tems, for instance described as “canalization” (towards a specific outcome despite 
uncertain starting conditions) in developmental biology. However, the understand- 
ing of how robustness is accomplished at the cellular or molecular level is still limited 
(Hartman et al., 2001), mainly because robustness is intimately linked to the ap- 
parent complexity of cellular systems. For instance, the main purpose of cellular 
control systems seems to be to guarantee reliable performance of vital functions un- 
der conditions of uncertainty (Lauffenburger, 2000; Csete and Doyle, 2002). Hence, 
elucidating high-level cellular design principles that could be exploited in systems 
modeling will require the simultaneous consideration of complexity and robustness 
in cellular networks—which is the topic of the present chapter. 

We will start with describing the sources and types of cellular complexity in 
more biological detail, before attempting to distinguish the type of complexity that 
is present in biological and physical systems by focusing on functional and organiza- 
tional principles that underly this complexity at a more abstract level (section 2.2). 
Robustness as a concept for understanding biological function and behavior will 
require a more in-depth exposition of the theoretical concept (section 2.3), before 
we discuss two biological example systems, namely central metabolism and circa- 
dian clocks (section 2.4). These examples are intended to explain how and why 
robustness can help in modeling cellular complexity (section 2.5). 
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2.2 Complexity of Cellular Networks 
2.2.1 Sources of Complexity 


Biological complexity arises at several levels. At the molecular level, heterogeneous 
regulation networks control individual cell responses to environmental changes. The 
basic biological information flow from DNA to biochemical activities—with inter- 
connected control mechanisms—is illustrated here for metabolic networks (figs. 2.1 
and 2.2). About a quarter of the around 4,000 genes in a typical microbe encode the 
enzymes that catalyze approximately 1,000 biochemical reactions. While all cells 
share essentially the same DNA, the rate of transcription (synthesis of mRNA from 
DNA) varies greatly for each gene. Dynamically controlled by overlapping networks 
of repressors and activators, transcription is further affected by the hard-wired loca- 
tion of the gene in an operon (or on the genome), promoter or initiation site quality, 
or more general mechanisms like DNA topology and epigenesis. Typically, regula- 
tory proteins themselves are subject to negative and positive feedback regulation 
through interaction with other proteins or metabolites. Next, mRNA is translated 
into protein, which again is regulated at multiple levels by different mechanisms 
that include mRNA stability, active degradation, attenuation (premature termina- 
tion as a function of the initial rate of translation), rare tRNAs, anti-sense RNA, 
quality of the ribosome binding site, etc. 
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Figure 2.1 Complexity in cellular networks. Flow of information (left) and example 





interaction network (right). Cellular components are, for example, regulatory proteins 
(ellipses, R), enzymes (ellipses, E), and metabolites (capital letters). Bold arrows indicate 
regulatory influences (activation or inhibition), while normal arrows denote chemical 
reactions. 


Essentially each step of protein synthesis is affected by multiple and overlapping 
regulation loops that operate both at the global cellular and a pathway/reaction 
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Figure 2.2 Complexity in cellular networks for a typical microbe such as LE. coli. 
Regulatory interactions are indicated by dashed lines. Transcript interactions are based 
on operon structures and ribosomal RNA interactions. Proteome interactions include an 
average of 6-7 protein-protein interactions as well as protein-DNA, protein-RNA, and 
protein-membrane interactions (see chapter 10 for details). Metabolic interactions include 
biochemical transformations and regulatory interactions between metabolites, RNA, and 
protein. Protein numbers encompass differences in folding, size, and covalent modifications. 
Note that not all proteins are necessarily present at the same time. 


specific level. Activity and stability of the synthesized proteins may then be mod- 
ulated by posttranslational modification (for example, phosphorylation), aggrega- 
tion to multimers, or complex formation with other proteins. Beyond such geneti- 
cically determined regulation, enzyme activity is often regulated by feedback in- 
hibition. This is a common regulatory principle in biosynthetic pathways, where 
endproducts inhibit the first enzyme in the pathway. In the multipurpose central 
metabolic pathways, several key enzymes are subject to feedback and feedforward 
inhibition and activation through multiple metabolites. Temporal coordination of 
control is achieved by combining rapid and sensitive regulation through feedback 
loops (seconds) with somewhat slower protein modification (seconds to minutes) 
and transcriptional/translational regulation (minutes). Almost no individual mech- 
anism achieves on/off effects but rather modulates processing rates in a 2-20 fold 
range. Thus, much of the complexity is based on multi-level combination of het- 
erogeneous control systems that tune strength and speed of cellular responses to 
stimuli. 

Unlike most technical systems, individual biological processes are extremely 
sensitive to the exact physico-chemical conditions because slight changes in, for 
example, temperature, pH, or the concentration and nature of the surrounding 
protein/membrane matrix influence the availability of substrates, products, and 
the kinetic properties of the enzymes themselves. Rarely are all physico-chemical 
parameters identical in independent experiments, but enzymes are also exposed 
to different micro-environments within a single cell that cannot be determined 
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exactly. An extreme, but not exclusive case is spatial separation into several 
distinct intracellular compartments—a distinguishing feature between eukaryotes 
and simpler prokaryotes. 

An additional level of complexity is the organization of different cell types into 
tissues and organs and finally of multiple tissues and organs into higher organisms 
(for instance, humans, plants). Not even in steady state cultures of single-celled 
microbes, however, are all cells necessarily in identical states. Driven by a not overly 
stringent control design, often subpopulations enter a resting state or simply exhibit 
different phenotypes, which increases chances to propagate the genetic offspring in 
an ever-changing environment. On longer timescales (days to years), the enormous 
potential of biological systems for evolutionary adaptation adds yet a different level 
of complexity. Random imprecisions in copying the genetic source code during 
cell duplication continuously increase the genetic diversity within a population. 
While the overall precision of the duplication process is extraordinary high—about 
0.003 point mutations occur per microbial genome (2-8 million base pairs) and 
round of replication—short generation times (minutes to hours) rapidly lead to 
recognizable genetic differences (Sauer, 2001). While most random differences have 
no apparent effect or are harmful, some variants bear the potential for improved 
survival upon drastic environmental changes. In contrast to most technical systems, 
biological systems thus continuously adapt by “redesigning” their makeup through 
the evolutionary process of mutation and selection. 


2.2.2 “Organized” versus “Emergent” Complexity 


The staggering complexity of cellular networks makes appropriate abstractions 
mandatory for meaningful mathematical modeling. An obvious pragmatic approach 
consists of decomposing the networks into smaller units that allow for the devel- 
opment of models of limited complexity. Likewise, models for cellular networks are 
not built at atomic resolution of individual biochemical species. More generally, 
however, with an ultimate goal of modeling entire cells and organs, we will need a 
deeper understanding of the specific type of complexity prevalent in biology to de- 
velop rigorous analysis methods. Here, we aim at outlining such a characterization 
by contrasting biological (and engineered) systems with complex physical systems. 

Complexity has become a field of intensive research in physics through the notion 
that systems with many components and interactions can show complicated col- 
lective (“emergent”) behavior. For instance, when adding sand to apparently stable 
sand piles, we cannot predict at which point the system reaches its “margin of sta- 
bility” and avalanches are generated. This does not mean that the behavior is not 
deterministic; we simply do not have complete knowledge of the initial conditions 
when starting such an experiment. As the system is extremely sensitive to changes 
in those conditions, the apparent behavior is chaotic. Similarly, simple sets of in- 
teracting particles can generate complicated spatial structures. Rationalizing these 
emergent properties often abstracts from the real systems by assuming homoge- 
neous components that interact randomly; analysis methods for characterizing the 
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collective behavior are often rooted in statistics (Goldenfeld and Kadanoff, 1999). 
Such approaches were, for instance, used in revealing rich and complex dynamic be- 
haviors that could be generated by simplified models of cellular signaling networks 
(Amaral et al., 2004). 

A different issue is whether this type of abstraction is useful for a deeper under- 
standing of biological complexity. At the first glance, biological systems differ in 
several aspects from the type of physical systems mentioned above. One hallmark 
is their heterogeneity of components and interactions. They are highly structured, 
which encompasses, among other things, sophisticated spatial organization and lay- 
ering of different types of control mechanisms. Finally, their complexity resides in 
these two features as well as in the sheer numbers of components and interactions. 
From a dynamic point of view, real biological systems are rather boring in that 
homeostasis and simple switching of states prevail, while complex behavior such as 
chaos mainly occurs under conditions when the systems are not working properly. 
Hence, today’s biological systems could perhaps best be understood as rare, ex- 
tremely improbable outcomes of emergent processes leading to primitive forms of 
life, and their subsequent shaping through evolution. 

Functional requirements constitute the main differences between complex physics 
and biology/engineering. In physics, they do not exist. Biological and engineered 
systems, in contrast, are evolved or designed to fulfill functions, and are constantly 
evaluated with respect to how well they perform. In both cases, insufficient perfor- 
mance will lead to extinction of a specific species, irrespective of whether this occurs 
through evolutionary or human design processes. The immediate consequence of a 
purpose is a considerably smaller design space, in which network structures that 
could be effective and reliable implementations are likely to be rare. Hence, we will 
face a more structured (instead of randomly connected) system. A hope for under- 
standing complexity in biology then is to uncover operational principles through a 
“calculus of purpose” (Lander, 2004)—by asking teleological questions such as why 
cellular networks are organized as observed, given their known or assumed function. 


2.2.3 Function and Organization Principles 


The purpose or function as one hallmark of cellular networks itself is a rather 
complicated concept. Attributing a particular function to a subnetwork may not 
be easy because it is in many cases context dependent. For instance, a particular 
signaling pathway may have roles in counter-acting biological processes such as the 
regulation of cell proliferation and apoptosis. Owing to the multiscale organization 
in biology, we need a precise notion of function at the different scales. For the 
example above, at the organismic level the pathway may coherently serve to achieve 
homeostasis of cells in an organ. Hence, we will need a hierarchical description of 
functions and corresponding organization principles at different levels, from coarse- 
grained overall architectures to detailed insight into individual network motifs 
(Shen-Orr et al., 2002). This corresponds to the modularization of explanations 
as a final aim of dealing with biological complexity. Here, we consider the global 
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architecture of metabolism as an example. In metabolism, analyses based on the 
networks’ stoichiometry alone (neglecting unknown kinetics and regulation) have 
revealed a close relation between network structure, function, and regulation at 
least for bacteria (see chapter 5 for details), which makes it suitable for high-level 
abstractions of organization principles. One possible principle has been proposed 
recently, focusing on “bow tie” structures as shown in fig. 2.3 (Csete and Doyle, 
2004). 
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Figure 2.3 Bow tie abstraction of cellular organization. Open arrows denote cellular 
regulation and control. Involvement of carriers such as ATP and NAD(P)H in individual 
processes is indicated by e. 


In the bow tie view, the basic network organization is a combination of fans 
of possible inputs (such as nutrients that can be processed) and possible outputs 
(for example, the variety of biomass components) that are linked through the core 
of central metabolism. Fans and core have rather different structural properties: 
while the former show many specialized, mostly linear pathways for catabolism and 
anabolism, the highly interconnected network of central metabolism generates and 
distributes only 12 metabolites as building blocks and a few carrier molecules (such 
as ATP and NADH) that are precursors for all biosynthetic processes. The carriers, 
in addition, serve as common currencies for all (energy- or redox-dependent) cellular 
processes. Hence, standard interfaces (such as the currency metabolites) and shared 
protocols (for instance, using (A)TP for energy-dependent reactions) establish 
coherence of the network. Cellular regulation relies on a similar structure, with 
a core of general transcription/translation/degradation processes mediating the 
information flow from genetic diversity to the large numbers of proteins and their 
variants. Nesting of the two bow ties is achieved through material flux and—more 
importantly—by abundant feedback regulation. Functional advantages of such an 
organization become especially clear when comparing it to a “flat” architecture with 
individual pathways leading from every substrate to every product. Such a solution 
would be very inefficient due to the number or complexity of enzymes required. 
Coordination of pathways and buffering of fluctuations in the environment could 
be achieved only with a massive overhead of regulation connecting all the individual 
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entities. In the long run, such a design would severely impede evolution because it 
would have to operate on entire pathways and their associated control systems. 

The bow tie organization, in contrast, can accommodate the divergent demands 
on metabolic systems. The core facilitates high-throughput of metabolites with 
only a few specialized enzymes. At all timescales, ranging from the fast regulation 
of the high-flux backbone by allosteric control to the slower expression control for 
individual pathways in the periphery, the structure facilitates systems integration 
and regulation. It appears not merely that biology uses the available control 
mechanisms but that the stoichiometry itself is highly structured and organized to 
facilitate the effectiveness of these control mechanisms to create coherent and global 
responses to variations, while allowing implementation in the local mechanisms. 
The shared interfaces and protocols, finally, create “plug-and-play” features, where 
less central reactions and pathways can easily be exchanged or added. Apparently, 
bow tie architectures are associated with risks not present in the simpler type of 
networks, such as high fragility when failures in the core affect the entire system 
(Csete and Doyle, 2004). It also means low variability of the core, as documented 
by the universality of the tricarboxylic acid (TCA) cycle at the heart of catabolism 
in all living organisms (Smith and Morowitz, 2004). Hence, the structures may 
primarily allow for optimal trade-offs between a variety of requirements such as 
efficiency, robustness, and evolvability (Csete and Doyle, 2004). 
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Figure 2.4 General features of bow tie structures. 


At a more abstract level, we see highly organized and structured networks 
that facilitate global and coordinated responses to variations in the environment 
on all time scales, using local and decentralized mechanisms. Fig. 2.4 illustrates 
the key features of the organization. The basic framework is employed in many 
advanced technological systems. The power grid, for instance, coordinates many 
producers and consumers with highly variable production and demand, respectively, 
by employing a common exchange protocol, namely 220 V AC. TCP/IP would be its 
equivalent for the internet. Clearly, from an engineering point of view, biology is a 
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marvel of technological “design.” We argue that analogies with engineered systems, 
in particular regarding how to generate appropriate responses to variations, are one 
major requirement on all highly integrated systems that can help us grasp biological 
complexity. 





2.3 Robustness in Cellular Networks 


The notion of robustness has recently received considerable interest in diverse fields 
for which the existence of complex networks is characteristic. Examples include the 
internet, social networks, and biology (Strogatz, 2001; Stelling et al., 2004b; Kitano, 
2004a). Not surprisingly, the term robustness has been associated with different, 
sometimes conflicting interpretations. Here, starting from a broad definition we 
aim at an operational concept that proves suitable for analyzing the properties of 
cellular networks. 


2.3.1 The Concept of Robustness 


In general, robustness means the persistence of a system’s characteristic behavior 
under perturbation or conditions of uncertainty. Robustness is, hence, defined for 
a specific system, which, however, may have arbitrary structural and behavorial 
features. The concept is closely related to stability in dynamical systems theory, 
but usually employed with respect to a broader class of phenomena (Kitano, 2002b; 
Carlson and Doyle, 2002). In engineering, the task of determining a system’s 
robustness is often accomplished by transformation into a suitable stability problem. 
However, compared to stability theory in systems dynamics, no elaborate theory of 
robustness exists yet. 

It has to be noted that robustness (such as stability) encompasses a relative, 
not an absolute, property of a system. No system can maintain stability for all its 
functions when encountering any kind of perturbation. Any operational definition 
of robustness, and systems analysis thereof, thus, requires two additional specifica- 
tions. Namely, it has to be explicitly clarified, (i) which characteristic behavior or 
function remains unchanged, and (ii) for which type of disturbances or uncertain- 
ties this invariance property holds. For relatively simple systems, the characteristic 
behavior can often be captured by definition of a dynamical regime. Investigations 
of oscillators may thus focus on the persistence of a regular periodic solution (see 
section 2.4.2 for an example). Moreover, robustness is a qualitative property, and 
does not preclude quantitative changes (in period or amplitude of the oscillations) 
to occur (Barkai and Leibler, 1997). For engineered or biological systems, one often 
understands by characteristic behavior the “desired system characteristics” (Carlson 
and Doyle, 2002) to be maintained. Here, robustness directly connects to function- 
ality. In technical as well as in living systems, it makes sense to protect key func- 
tions by design, or as a result of evolution. Especially in biology, however, function 
can, in many cases, not easily be assigned to a particular subsystem of a cell or 
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organism (Morohashi et al., 2002). In bacterial chemotaxis, for instance, maintain- 
ing the ability to adapt to changing nutrient concentrations, whereas adaptation 
times are allowed to fluctuate, is intuitively understandable. As a counter-example, 
signal transduction relies upon sensitive detection, amplification, and decoding of 
input signals. It would not be sensible to react identically irrespective of the signals 
received. Identification of key inputs and outputs for specific sub-systems, how- 
ever, may not be evident from the complex overall network structure, and cellular 
signaling requires both robustness and precision (Freeman, 2000). The claim of 
higher-order behavior or entire modules to be robust and, hence, imply functional 
advantage, therefore needs careful justification. 

Similar considerations apply for the specification of perturbations. Cellular sys- 
tems face three broad classes of uncertainties, namely (i) externally induced per- 
turbations owing to variable environments, (ii) internal perturbations arising from 
changes in the structure of the system (such as mutations affecting kinetic prop- 
erties of proteins, or leading to the lack of components), and (iii) intrinsic noise 
as a consequence of the low copy number of many cellular components. The first 
two classes of disturbance can be dealt with in a deterministic framework. External 
perturbations may directly influence the solutions of a dynamical system; resis- 
tance to these influences equals the notion of stability in dynamic systems theory. 
Perturbations affecting the structure of the systems itself, but which do not re- 
sult in qualitatively different dynamics, reveal structural stability of a system (see 
chapter 6). These two types of perturbations can, hence, be mapped on changes 
in inputs and system parameters, respectively. Stochastic effects resulting from the 
random character of biochemical reactions (see chapter 8) in principle require an 
explicit inclusion of noise in robustness analysis (Rao et al., 2002). In gene expres- 
sion, for instance, intrinsic noise considerably contributes to overall variation, with 
potential amplification and propagation by regulatory dynamics (Thattai and van 
Oudenaarden, 2001; Elowitz et al., 2002). Hence, also the theoretical methods for 
analyzing robustness have to be tailored to the specific aspects of a system under 
investigation. 


2.3.2 Mechanisms for Robustness 


Mainly four ingredients are currently discussed as cellular design elements for the 
protection against deleterious disturbances. These encompass (i) back-up systems 
(redundancy), (ii) disturbance rejection through feedback control, (iii) structuring 
of complex systems into semi-autonomous functional units (modularity), and (iv) 
their reliable coordination via establishment of hierarchies and protocols (Csete and 
Doyle, 2002; Kitano, 2002a; Stelling et al., 2004b). We will discuss their potential 
contributions for conferring robustness to cellular networks—and for the analysis 
thereof—in this section. 

The simplest strategy to protect against failure of a specific component is to 
provide for alternative ways to carry out the function the component performs. 
However, genes that do not diverge in functionality or regulation would not survive 
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during evolution (Krakauer and Plotkin, 2002). In particular the genomics revolu- 
tion with comprehensive gene knockout libraries of entire organisms has initiated 
the quest to identify mechanisms that underlie the seemingly surprising number 
of phenotypically silent deletion mutations; that is, only about 1,100 knockouts of 
the 5,700 genes are lethal in haploid S. cerevisiae. In this context, the term genetic 
robustness was coined to describe the condition in which a gene may be deleted 
without qualitatively compromising cell growth (Gu et al., 2003). The explana- 
tion is trivial for at least the approximately 1,000 genes with metabolic functions: 
about 45% of the known metabolic genes are simply not active under the inves- 
tigated condition (Papp et al., 2004; Blank et al., 2005). For the remaining 207 
viable S. cerevisae mutants of active reactions during glucose catabolism, network 
redundancy through duplicate genes was the major (3/4) and alternative pathways 
the minor (1/4) molecular mechanism of genetic network robustness (Blank et al., 
2005). Although duplicate genes clearly contribute to the robustness of metabolic 
networks to gene deletions, the argument cannot be turned around that this is 
indeed their function because this would imply that it is a distinct mechanism. 
Quantitative analyses of the 105 duplicate gene families in S. cerevisiae clearly 
demonstrated that no particular dominant function maintains duplicate genes in 
the genome (Kuepfer et al., 2005). In particular the putative back-up function is not 
favored by evolutionary selection because duplicates do not occur more frequently 
in essential reactions than singleton genes. Hence, redundancy plays some role in 
biological robustness, but it may be largely overrated and misunderstood. 

More importantly, feedback loops can account for robustness in cellular network 
function. By using the output of a function to be controlled in order to deter- 
mine appropriate input signals, feedback enables a system to adjust the output by 
monitoring it. In general, negative feedback is employed in reducing the difference 
between actual output and a given set-point, thereby dampening noise and rejecting 
perturbations. For instance, a simple, engineered feedback loop relying on negative 
autoregulation of a transcription factor stabilized steady-state gene expression levels 
despite the inherent noise in gene expression. This autoregulation proved advanta- 
geous over unregulated transcription for a range of biologically plausible parameters 
(Becskei and Serrano, 2000). The role of positive feedback (or autocatalysis) in con- 
ferring robustness is less obvious, since it may cause instabilities. However, decisions 
for example in development need to be derived from noisy and graded input sig- 
nals and have to be maintained (see chapter 1). In one example from engineered 
gene networks (see chapter 13), two genes mutually repressing each other’s expres- 
sion (double-negative feedback) proved sufficient to construct a reliable irreversible 
switch (Gardner et al., 2000). Enhanced sensitivity through positive feedback also 
speeds up stress responses. Depending on which cellular functionalities require pro- 
tection from perturbations, both forms of feedback and combinations thereof can 
contribute to robustly achieve a desired behavior (Freeman, 2000). Therefore, in 
many cases where highly precise and reliable behavior is indispensable for overall 
cellular functionality, multiple intertwined feedback loops operate (Ferrell Jr., 2002) 
(see also section 2.4.2). True redundancy is most useful when it is part of feedback 
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control systems that can sense variations and failures, and coordinate the use of 
multiple resources. Trivially, there are lots of copies of enzymes at the protein level, 
even when there is one gene, and the number is controlled. 

Focusing on the internal structure of cellular systems, one central, increasingly 
discussed notion is that these systems are composed of “functional units” or “mod- 
ules”. Modules can be understood as semi-autonomous entities that show dense in- 
ternal functional connections, but looser connections with their environment (Krem- 
ling et al., 2000; Girvan and Newman, 2002). With respect to robustness, modularity 
can lead to a benefit for overall functionality of complex systems. Encapsulation 
of simpler functions can reduce the risk of catastrophic failure by preventing the 
spread of damage in one module throughout the network (Hartwell et al., 1999; 
Albert et al., 2000). However, two critical issues have yet to be clarified, namely 
to prove the existence (or absence) of modularity in cellular systems, and to estab- 
lish methods for the unanimous identification of modules (Lauffenburger, 2000). As 
discussed in detail in chapter 3, both problems are intimately linked. 

Protocols encompass the set of rules aiming at an efficient management of rela- 
tionships between the parts (for example, modules) that constitute a system. They 
include, for instance, the organizational structures for embedding modules and the 
interfaces between modules that allow for system function (Csete and Doyle, 2002). 
A common protocol in biology, for instance, is “protein phosphorylation relies on 
ATP.” Using only 12 basic building blocks in metabolism is a similar convention. 
Protocols, hence, are of primary importance for an understanding of how informa- 
tion in complex systems is integrated (Hartwell et al., 1999). One efficient means 
for coordination in complex systems is to organize a system hierarchically, namely 
to establish different layers of integration (Mesarovic et al., 1970). This architec- 
ture, for instance, helps to reduce the costs of information transmission (Guimera 
et al., 2001). Several lines of evidence suggest that hierarchical structures confer 
robustness to cellular systems. One major proposition is that separation of func- 
tions, and their integration at higher levels, reduces the average damage owing to 
arbitrary perturbations of the network. Analysis of dynamical networks with overall 
structures similar to those of cellular networks demonstrated a superior systems per- 
formance and controllability when feedback control specifically operates on higher 
levels of integration (Wang and Chen, 2002). Moreover, as we argued already in sec- 
tion 2.2.3, well-designed hierarchies and protocols can contribute to robustness, for 
instance, by constraining the effects of local deregulation or by providing common 
standards for coordination of cellular functions. 


2.3.3 Robustness, Fragility, and Complexity 


With the variety of mechanisms for incorporating robustness into cellular systems 
available, it appears surprising that cells are sensitive to quantitatively minor, but 
extremely powerful perturbations such as oncogenic mutations that enable profound 
changes at a genomic scale. Two possible explanations would be that either evolution 
has yet to attain optimal robustness, or that principal limitations exist regarding 
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how robust the systems can be made. The overwhelming evidence speaks for the 
latter hypothesis, which we will discuss now. 
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Figure 2.5 Robustness and fragility trade offs in feedback control. (A) Reaction scheme 
of glycolysis with activating (by ADP and F16P, grey arrow heads) and inhibiting (ATP, 
gray bar heads) influences of co-factors and metabolites on Phosphofructokinase (PFK) as 
a key glycolytic enzyme. (B) Response of the system to a sudden up-shift in ATP demand 
at different feedback gains h. (C) Relative fragility F as a function of the frequency w. 





Consider the following example from control of glycolysis (fig. 2.5A): phospho- 
fructokinase (PFK) at the center of the pathway is a highly regulated enzyme, 
with activation by the products of the reaction it catalyzes (ADP and fructose-1,6- 
bisphosphate (F16P)), and inhibition through its co-substrate ATP. Among others, 
this feedback structure allows the cell to adapt to varying ATP demands, while 
keeping the cellular ATP concentration tightly regulated. As shown in fig. 2.5B, 
the effect of a step increase in ATP demand—and thereby a sudden decrease in 
the concentration of PFK-inhibiting ATP—eventually leads to an increased flux 
through glycolysis and corresponding ATP production. The degree of recovery in 
ATP concentration apparently depends on the strength (or “gain”) of the feedback 
h. Higher feedback gain eventually reduces the steady-state deviation between ideal 
and predicted response. However, increased precision in the long run is accompanied 
by more pronounced transient responses to the perturbation. 

For a more quantitative analysis of this connection, let us employ the absolute 
sensitivity for a frequency w, |S(w)|, as a measure of the deviation from perfect 
control. By defining a fragility F(w) = log|S(w)|, the sign of F(w) indicates if 
perturbations will be attenuated (F(w) < 0) or amplified (F(w) > 0). Analysis 
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in the frequency domain shows that the effect of feedback in glycolysis in fact is 
two-fold: it increases robustness at low frequencies (for example, steady-state), but 
introduces fragility at higher frequencies (figure 2.5C). This is indicative of a certain 
“conservation of robustness”—increased robustness somewhere will be compensated 
by increased fragility elsewhere. For certain types of (linear) systems, the co-called 
Bode sensitivity integral (Bode, 1945) even describes this trade-off quantitatively 
as a conservation law: 


[ F(w)dw=0. (2.1) 
0 


Note, however, that a formal proof currently is only possible for a very limited class 
of dynamical systems. 

The concept of “highly optimized tolerance’? (HOT) relies on the very idea 
that robustness has to be regarded as a limited and conserved resource. This 
quantity (tolerance) requires careful distribution, adapted to the function a system 
is intended to perform, and the associated uncertainties. High optimization refers 
to a strategy of simultaneously achieving high performance and error-tolerance by 
a high degree of internal organization. The management and overall conservation 
of robustness lead to a “robust yet fragile” behavior of such systems, namely a high 
robustness (“barriers to cascading failures”) in the face of anticipated or usually 
encountered disturbances, but hypersensitivity to unexpected perturbations, design 
flaws, or hijacking (Carlson and Doyle, 1999, 2000). In addition, HOT emphasizes a 
necessary connection between complexity and robustness. Making certain functions 
of a system more insensitive to disturbances, for instance, may require additional 
control loops. This, in turn, leads to higher complexity and to new potential sources 
of fragility. The effect is a “spiraling complexity” in which new features expose new 
fragilities to be “fixed” by further additions to the system (Carlson and Doyle, 
2002). Hence, the distribution of robustness/fragility may be key for understanding 
system design in cell biology. 


2.3.4 From Robustness to Evolvability 


An often noted reservation against the type of analogies between biological and 
engineered systems we brought forward states that these two types of complex 
systems arise in fundamentally different ways, namely through evolution versus 
purpose-driven, top-down design (see, for example, Bosl and Li (2005)). Clearly, 
evolvability is of paramount importance for living systems (Kirschner and Gerhart, 
1998). Here, we think of evolvability simply (maybe naively) in the sense of 
controlled and structured change in lineages, rather than cells, on long time scales in 
response to perhaps large variations in the environment. At the population level (of 
all engineered systems of one type), evidently progress in engineering fulfills similar 
criteria. More importantly, the generic mechanisms and structures responsible for 
robustness do not operate at the expense of evolvability—in fact they facilitate 
both. 
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Genetic redundancy allows duplicate genes to acquire new functions without 
perturbing the cells under most conditions. Feedback control, for instance, supports 
the normal operation even during evolutionary changes. The exchange of modules 
such as biosynthetic pathways through lateral gene transfer (for instance, via 
plasmids carrying the corresponding operons) lets organisms easily gain completely 
new functions. Finally, protocols are of paramount importance for facilitating plug- 
and-play mechanisms. Protein kinases, for instance, gain new functions by changing 
substrate specificity and control of their activity, but the common currencies of ATP 
and phosphate groups as effectors remain functional. In the realm of technology, 
Lego is one of the best examples of an evolvable system on many time scales (Csete 
and Doyle, 2002); the common carrier for the power grid, which facilitates control in 
response to short term fluctuations in supply and demand, also facilitates long term 
evolvability by providing a simple protocol for suppliers and consumers. Hence, we 
could think of evolvability as robustness on longer time scales, which is also subject 
to selection during evolution (Earl and Deem, 2004). 





2.4 Biological Examples 
2.4.1 Robustness in Central Metabolism 


Although complex in their operation, metabolic networks are structurally organized 
such that a large variety of biochemical products and complex macromolecules are 
synthesized from myriads of nutrients by conversion through relatively few common 
intermediate metabolites. This so-called bow-tie architecture (section 2.2.3) results 
in an ubiquitous and interconnected core set of central reactions that constitute the 
backbone of high metabolic fluxes. The central carbon metabolism, in particular, 
provides a plethora of alternative routes for generating essential precursor molecules 
and the carrier molecules for energy (ATP) and reduction equivalents (NAD(P)H). 
Here, we describe two complementary—theoretical and experimental—approaches 
for analyzing the robustness of central metabolism. 

At the theoretical level, elementary flux mode (EFM) analysis decomposes the 
metabolic network into meaningful smaller units or pathways. These EFMs can be 
defined as the smallest sub-networks enabling the metabolic system to operate in 
steady state (Schuster et al., 1999) (see chapter 5). The high number of EFMs in E. 
coli central metabolism on different substrates (see figure 2.6A,B) directly reflects 
the flexibility of this network (Stelling et al., 2002). Although all substrate regimes 
comprise the same number of reactions and metabolites, the EFMs differ by two 
orders of magnitude (figure 2.6B). When considering only single-substrate regimes, 
glucose can be utilized in approximately 45 times more different ways than acetate, 
which corresponds to biological intuition. Simultaneous utilization of all substrates 
enhances the number of alternative pathways by a factor of ten. 

A plausible hypothesis concerning the connection between network flexibility and 
robustness is that the degrees of freedom of a network could be used to predict 
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Figure 2.6 Flexibility and robustness in central bacterial metabolism. A stoichiometric 
model for FE. coli with 89 substances and 110 reactions was decomposed into elementary 
flux modes (EFMs) (Stelling et al., 2002). (A) Schematic network representation. Shaded 
areas indicate the intracellular space. Only major nodes and twelve precursor metabolites 
(bold face) (Neidhardt et al., 1990) are labeled; reactions were partially combined. (B) 
Number and distribution of EFMs for different substrates. (C) Effect of arbitrary gene 
deletions on viability for single (e) and for multiple (o) substrates as a function of the 
total number of EFMs in wild type N(S1,...,Sn). 


its sensitivity to disturbances. For different single-substrate uptake regimes, the 
organism’s resistance to arbitrary gene deletions correlates well with the number of 
EFMs N for the corresponding wild type (figure 2.6C). Similar results are obtained 
when more then one substrate can be utilized. Here, in general, the number of viable 
mutants is higher than for single-substrate regimes showing a comparable number 
of elementary modes. Most likely, this represents the effect of higher degrees of 
independence of metabolic pathways for the multi-substrate case. The ability to 
utilize different carbon sources simultaneously could, thus, be advantageous for the 
organism’s robustness. 

Mechanistically, robustness in central metabolism can be assessed by !3C-tracer- 
based flux experiments (Sauer, 2004). The particular strength of quantitative flux 
data is their high degree of integrative information on regulatory and biochemical in- 
teractions within the network. A recent systematic in vivo flux analysis investigated 
flexibility and optimal performance in central metabolism of the model prokaryote 
Bacillus subtilis by selecting a near random choice of 137 knockout mutants that 
roughly reflect the proportion of all major functional gene categories (Fischer and 
Sauer, 2005). The data revealed a remarkably robust distribution of intracellular 
carbon fluxes, as shown exemplarily for three key fluxes (figure 2.7). The flux par- 
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titioning between alternative pathways was generally very robust against genetic 
perturbations and, for several pathways, completely independent of the absolute 
flux through the branch point. The only detected branch point that featured any 


significant flexibility in flux partitioning to different pathways was acetyl-CoA, the 
entry substrate into the TCA cycle. 
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Figure 2.7 Relative fluxes through the pentose phosphate (PP) pathway, the TCA cycle, 
and gluconeogenesis from oxaloacetate to PEP in 137 B. subtilis knockout mutants during 
growth on glucose (Fischer and Sauer, 2005). 


This control architecture of metabolism that maintains an unexpectedly stable 
metabolic state under a given environmental condition appears to be designed 
to provide a rigid flux distribution. While this state was robust against random 
genetic perturbations, it was sensitive to regulatory mutations because several 
regulator knockouts specifically affected flux partitioning at the acetyl-CoA branch 
point; that is, reduced the TCA cycle flux. The combination of high robustness 


and suboptimal efficiency also illustrates the need for trade-offs between different 
functional requirements. 


2.4.2 Control Architectures in the Circadian Clock 


Circadian clocks provide endogenous cellular rhythms of approximately 24 hours 
that directly or indirectly control many physiological processes and have been 
observed in species across four kingdoms. At the molecular level, however, they show 
an apparently complex regulatory architecture with multiple intertwined positive 
and negative feedback loops. For the fly and the mouse, the cellular genetic networks 
contain delayed transcriptional feedback mechanisms (Hastings, 2000). The core 
of the heavily studied Drosophila transcriptional feedback network is shown in 
figure 2.8 (Hastings, 2000; Reppert and Weaver, 2000; Young and Kay, 2001). 

The transcription rates of the genes per (period) and tim (timeless) are accel- 
erated when protein dCLK binds to their promoter regions. The transcribed per 
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Figure 2.8 Core genetic network of the Drosophila circadian clock, adapted from 
(Hastings, 2000; Reppert and Weaver, 2000; Young and Kay, 2001). 


and tim mRNAs are exported from the nucleus and translated into proteins PER 
and TIM, respectively. In the cytoplasm the protein DBT (doubletime) binds to 
PER. DBT either phosphorylates PER, causing it to be degraded, or allows PER to 
bind to TIM after a delay, thereby protecting it from degradation. After the DBT- 
PER-TIM complex is formed, it is imported into the nucleus where it represses the 
transcription of per and tim and activates the transcription of dClk (clock). The 
dClk mRNA is exported from the nucleus and translated into protein dCLK. Pro- 
tein dCLK is imported into the nucleus where it represses the transcription of Clk 
and activates the transcription of per and tim. This system can be characterized 
by a two loop transcriptional feedback network, where DBT-PER-TIM negatively 
feeds back on per and tim transcription and activates dClk transcription, and dCLK 
negatively feeds back on Clk transcription and activates per and tim transcription. 
In addition to the main (double) negative feedback loop, there are loops involving 
the genes vri and Pdple. This multi-loop architecture is shared by mammals, al- 
though some homologous proteins play different roles (Reppert and Weaver, 2000; 
Leloup and Goldbeter, 2003; Forger and Peskin, 2003). 

Model-based analyses of these networks have pointed out their remarkable robust- 
ness in the presence of molecular noise (Barkai and Leibler, 2000; Ueda et al., 2001; 
Gonze et al., 2002) and with respect to parametric perturbations (Smolen et al., 
2001; Leloup and Goldbeter, 1999; Stelling et al., 2004a). Different models display 
model-specific robustness and fragility properties (Zak et al., 2001; Stelling et al., 
2004a). Employing tools from systems engineering, Stelling et al. (2004) performed 
a comparative analysis of the global robustness and fragility properties of two pub- 
lished mathematical models for the Drosophila circadian clock. Both deterministic 
models relied upon negative autoregulatory feedback for generating sustained oscil- 
lations. A less complex 5-state model with only one branch (Goldbeter, 1995) anda 
10-state model including two distinct branches of the control system for per and tim 
(Leloup and Goldbeter, 1999) were considered. To gain insight into the structure- 
function relationship, they studied robustness towards parametric perturbations by 
numerical computation of the parameter sensitivities (see chapter 1). 
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For the detailed analyses of both fly clock models, model parameters were orga- 
nized in functional categories (for example, transcription, translation, phosphoryla- 
tion, etc.), as well as into hierarchical categories. For the latter, global parameters 
reflected characteristics of well regulated core cellular machineries (such as the 
maximal capacity of the general transcriptional apparatus embodied in maximal 
transcription rates for all genes), while local parameters were primarily confined to 
the circadian oscillator. The analysis revealed clearly that global parameters were 
more fragile, in comparison to the more robust local parameters. Furthermore the 
separation between the two was sharpened by the complex hierarchical organiza- 
tion underlying the fly dual feedback clock model (as opposed to the single feedback 
engineering model). In agreement with the bow tie proposition for cellular organiza- 
tion (section 2.2.3), these results suggest a design principle of cellular regulation, in 
which robustness of specific (local) functions is achieved by delegation of fragilities 
to global control circuits (Stelling et al., 2004a) (figure 2.9). The same trade-offs 
are observed in the mammalian clock architecture (F. Doyle, unpublished results). 
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Figure 2.9 Scenarios for distribution of robustness and fragility. (A) Concentration of 
fragilities in a central core, whereas functional modules are error tolerant; gray levels 
correspond to levels of fragility. (B) Equal distribution of fragilities. 


One important consideration in the analysis of robustness, particularly with 
regard to the circadian rhythm circuit, is the selection of a performance attribute 
for evaluation of its robustness characteristics. For example, Stelling et al. (2004a) 
compared the rank ordering of sensitivities (robustness) for both period of the 
proteins and transcripts, and their amplitude. Not surprising, the order is changed 
significantly, with transcriptional /translational regulation having a larger impact on 
amplitude, while phosphorylation/dephosphorylation have a larger impact on the 
clock’s period. Additional attributes may be considered for robustness, including 
entrainment, phase response sensitivity, and relative phase timing of key proteins. In 
general, the conclusions drawn about robustness may vary as different attributes are 
evaluated. Moreover, the scale of the network analysis may influence the conclusions: 
single-cell attributes are likely to be quite different from whole organism robustness. 
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2.5 Consequences for Systems Modeling 


In this chapter, we focused on rather abstract concepts that deal with cellular com- 
plexity, function and structure. As the biological examples in the previous section 
showed, such concepts provide an organizational framework for modeling—and fi- 
nally understanding—cellular systems. In particular, insight into the robustness 
of cellular networks can guide us in what and how to model for such systems. 
In general, given the small repertoire of mechanisms providing robustness, mod- 
eling could specifically probe these mechanisms; the analysis of bacterial central 
metabolism suggests that network properties rather than redundancy of individual 
components should be in the focus of such efforts. The circadian clock examples 
revealed structuring of sensitivities in agreement with the predictions made by the 
bow tie hypothesis. For model development, such features help in identifying the 
important /less important parts of the system. For instance, they would suggest 
devoting more efforts to detailed modeling of the core processes because these are 
likely to be highly sensitive to design flaws in the models. In addition, characteristic 
distributions of robustness and fragility can be exploited to decompose larger net- 
works into manageable subunits. Such an approach proved successful, for instance, 
for the analysis of signal transduction processes in apoptosis (Bentele et al., 2004). 

More generally, robustness may facilitate model development because exact pa- 
rameter values are not required in many instances, and sensitive parameters could 
possibly be predicted from network architectures. In other words, robustness im- 
plies an importance of accurately describing the structure of a system as opposed 
to identification of the associated parameters. A classical study on the segment 
polarity network in Drosophila revealed that without the appropriate (feedback) 
structures, despite large freedom in choosing parameter values, even the qualita- 
tive behavior of this developmental control circuit could not be reproduced (von 
Dassow et al., 2000). For practical purposes, the importance of network structures 
may imply that the transferring quantitative models between similar systems, such 
as different cell lines from one organism, might reduce to adjusting a few key pa- 
rameters. These are some justifications for why qualitative and structural modeling 
methods—although devoid of parameters—may yield deep insight into the relations 
between network structure and behavior (see chapter 7 and chapter 5). Given our 
incomplete knowledge on cellular circuits, moreover, we often face the challenge of 
evaluating sets of different hypotheses on cellular network structures. The robust- 
ness property allows for relatively easy discrimination between hypotheses because 
exact parameter values are not important in many cases. For instance, the models’ 
ability to perform robust control tasks could be used in elucidating network struc- 
tures underlying morphogen gradients in embryonic patterning (Eldar et al., 2002, 
2003). In the model world, thus, robustness could be employed as one criterion for 
assessing the plausibility of a particular model (Morohashi et al., 2002). However, 
we always have to be aware that prior knowledge is essential for determining the 
exact nature of robustness for a system. 
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Finally, representation of a cellular network through a mathematical model is 
always only the first step towards understanding—subsequently we have to ask 
why the models perform as they do, and which are the underlying design princi- 
ples (Lander, 2004). The analysis of robustness properties can lead to abstractions 
in this direction, for instance, by revealing a common operating principle in bac- 
terial chemotaxis despite different molecular implementations in different organ- 
isms (Rao et al., 2004) (see also chapter 12). From engineering, it is known that 
feedback control (plus feedforward control) enabled by fast and if possible remote 
advanced-warning sensing is the most powerful mechanism for providing robust- 
ness to fluctuations in the environment and the component parts. The heat-shock 
response in E. coli appears to employ exactly the same principles as shown by de- 
tailed modeling and subsequent model reduction to the core elements (El-Samad 
et al., 2005). Future studies can make use of such principles by searching for this 
type of mechanism. Hence, abstract concepts on complexity and robustness have 
broad implications both for systems modeling and systems analysis. 





2.6 Concluding Remarks 


In this chapter, we adopted a high-level view of cellular systems by combining 
biology and engineering approaches. This perspective does not want to disguise 
large differences between the two types of systems; in fact, biology often shows a 
more remarkable “design” than technology. However, it appears as though there are 
universal principles in biology and technology that facilitate robustness, efficiency, 
and evolvability. We do not yet have a clear and concise characterization of them all, 
but we can say some things: (i) feedback control is the most powerful mechanism 
for providing robustness to fluctuations in the environment and the component 
parts; (ii) redundancy plays some role in robustness to component variations and 
failures but is most useful when it is part of feedback control systems that can 
sense variations and failures, and coordinate the use of multiple resources; (iii) 
protocols that enable carrier and building block—based metabolism facilitate both 
decentralized control and supply chain management for short term fluctuations as 
well as plug-and-play modularity for long term evolution. Taking such abstractions 
into account for systems analysis in biology—as several examples showed—can 
provide the necessary guidelines for modeling and analyzing biological complexity. 
In our view, the close relations between complexity and robustness requirements 
may imply that living cells are complicated, yet comprehensible systems. 


3 On Modules and Modularity 


Zoltan Szallasi, Vipul Periwal, and Jörg Stelling 


The enormous complexity of biological systems begs for unifying, simplifying con- 
cepts that might allow a predictive understanding of their functioning. Suggestions 
for such concepts include “modularity” along with robustness, discussed in the pre- 
vious chapter. No comprehensive survey of system modeling can ignore these con- 
cepts, even if it means pointing out the lack of consistent and clear definitions in 
the field. Modularity is without doubt an enticing concept that may hold promise 
for helping to overcome some of the computational limitations of dynamic modeling 
of biological systems. The list of what cannot be achieved far exceeds the utility 
of this concept as demonstrated thus far. As this chapter outlines, the long and 
arduous task of laying its rigorous quantitative foundations is in its infancy. 





3.1 Introduction 


Biological systems are often said to be “complex.” Is this a precise logical concept, 
in the sense that given a set of systems we can unambiguously separate the complex 
systems from the simple ones, or is this merely an adjective assigned on the basis of 
the user’s inability to comprehend the relation between inputs and outputs of the 
system? In the quantitative science literature, a more or less standard definition 
exists: A complex system is a system whose properties are not fully explained 
by an understanding of its component parts. Complex systems consist of a large 
number of mutually interacting and interwoven parts, entities, or agents. As is 
evident from this standard definition, there is considerable ambiguity implicit in its 
negative character. A variant of this definition, interesting for biology, posits that 
both understanding and verification of design and/or function is difficult in complex 
systems. 

As a counterpoint, modeling a biological system is an exercise in understanding 
how the outputs arise from the inputs. In light of the preceding definition, we 
might then suppose that biological modeling is the process of moving systems from 
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the complex systems set to the simple systems set. A natural notion in modeling 
complex systems is to replace some of the parts being modeled with an abstraction, 
while maintaining the fidelity of the model with the given experimental data. 
This requires, usually, the maintenance of the functional interface of the replaced 
parts with the rest of the system. For example, the engine in an automobile is an 
abstraction for a large number of components. The functional interface with the 
rest of the automobile is provided by the drive shaft and various hoses and wiring 
harnesses. Such functional abstractions are often called modules. 

Almost all artifacts of evolved human engineering are modular through and 
through: their entire architecture is composed of parts packaged within bigger parts 
with clear functional interfaces. This is true of electronics, it is true of houses, and 
it is true of software. In these systems, all parts are members of some module, and 
the entire architecture is modular. It would seem, based on this experience, that 
the way forward in simplifying the complex biochemistry of life is to encapsulate 
complexity in similar modules. Certainly, the computational limitations of dynamic 
gene network modeling are much easier to evade and an understanding of complex 
networks in terms of (higher-level) functional interactions is easier to achieve if 
a modular architecture underlies the network. The question at hand is: To what 
extent does modularity provide realistic and useful abstractions for systems shaped 
by biological evolution? 

There are two separate issues here — the existence of modules in biology, and 
the utility of this concept, although it will most likely be an approximation of 
reality (as in any abstract model). Regarding the existence of modules, even in the 
engineering example, autonomy is never absolute. We therefore have to consider 
subsystems of limited (quasi-)autonomy. One subissue is whether we can identify 
modules in the biochemical interaction network that work quasi-autonomously, like 
the engine in our analogy. The other subissue is the existence of an overall modular 
architecture for the entire biochemical interaction network. The modules evident in 
the morphological structures present in each eukaryotic cell, such as the nucleus, 
the mitochondria, and other organelles, as well as the presence of specialized organs 
in metazoans are certainly evidence of a modular architecture at a high level. In 
this sense, the success of organ transplants can be considered as taking advantage 
of the existence of modules in the human body for medical intervention. Close to 
fundamental biochemistry, biological concepts of a “gene,” a “protein,” or a “protein 
domain” are widely employed abstractions from the underlying chemistry. In the 
context of interaction networks, for instance, protein functions are usually not 
discussed in terms of the protein’s atomic coordinates. A given protein is thus 
considered to be a module of all of its constituent atoms. Hence, at these two very 
separate levels of cell biology we find evidence for modularity. But does this hold 
for modularity at all levels in biochemistry? 
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3.2 The Concept of Modules in Other Biological Disciplines 


Evolutionary biologists have long considered a type of modularity in which the 
animal body is composed of units, which integrate functionally related characters 
(with characters in genetics defined as structures, functions, or attributes deter- 
mined by a gene or group of genes) into units of evolutionary transformation (Wag- 
ner, 1996). They have also investigated extensively the origins of this modularity, 
either from evolution or from a priori principles of organization for reproducing 
systems. In terms of the evolutionary origins of modularity, modularity could arise 
from specialization resulting in the elimination of some pleiotropic effects from a 
more integrated phylogenetically primitive state, that is, a bigger module splitting 
into two more independent submodules or from the opposite process when a linked 
functional role leads to differentially greater integration of evolutionary characters, 
preventing independent variation. These two processes are acting in independent, 
and potentially opposing, directions. Thus an evolving system never exhibits per- 
fect nonoverlapping modularity, just as a matter of simple irreversible statistical 
mechanical relaxation. 

The presence of modules may well enhance the rate of evolution due to noninter- 
ference between functional roles, though this mechanism is unlikely to be of interest 
in multilocus systems because it is hard to maintain the necessary level of linkage 
disequilibrium in multilocus systems (Wagner, 1996). Stabilizing selection, likely the 
mode of selection experienced most of the time, is blind to modular organization in 
systems with multiple characters, neither enhancing nor washing out distinctions. 
Directional selection forces, on the other hand, may result in the adaptation of a 
small number of linked characters, preserving other characters under the influence 
of stabilizing selection. Pleiotropic effects may interfere with adaptation, perhaps 
leading to mutations that decrease pleiotropic effects linking the genes associated 
with the adapting characters to other characters, and thereby leading to the ap- 
pearance of modules. Thus evolutionary biology favors the appearance of modules, 
but not necessarily modularity in the overall organization. It is also apparent in 
the argument for the appearance of modules that the environmental circumstances 
which favored the decrease in pleiotropic effects are integral to the definition of 
these modules. Therefore, the response of a biological system may reflect the exis- 
tence of certain modules only in specific contexts. This is not necessarily the case 
for modules in human engineering constructs. 

Modularity has also been investigated extensively in neurobiology. In fact, the 
notion has been considered independently in the three fields of psychology, neu- 
roscience, and artificial intelligence, which can be regarded as the neurobiology 
analogues of physiology, molecular biology, and biological modeling in systems bi- 
ology, respectively. It is instructive to note that workers in each of these fields have 
their own definitions of modularity (Bryson, 2005). Reconciling these definitions is 
an important part of understanding the actual behavior of organisms, and it is just 
as likely that modules found in cellular physiology are intricately related to modules 
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found in molecular biology. From a physiological point of view, modularity might 
be considered in the form of the hypothesis that the cell contains independent input 
systems that are restricted in the range of environmental and cell-state information 
that they can access. 





3.3 The Concept of Modularity in Systems Biology 


The interest in modules in the systems biology context was expressed clearly in 
(Hartwell et al., 1999), albeit mainly invoking a hypothetical parallel between 
human and evolutionary design and providing little in the way of evidence. As is 
evident from other disciplines interested in biological modules, there is a lack of well- 
defined, quantitatively applicable definitions. One reason for this lack is that the 
concept of “modular design” is borrowed from human engineering and therefore has 
an essentially forward looking, goal-oriented nature. Complex engines and networks 
are constructed from modules while the final overall behavior of the system is kept 
in mind. It is much more difficult to identify a “modular architecture” in an already 
existing complex network, such as a cell, especially in an unsupervised fashion. 
Overlapping modules and multiple “hidden” or ill-defined functions of subsystems 
pose additional, potentially insurmountable, difficulties (see also Figure 3.1 and 
text below). We would encounter similar difficulties in human designed systems 
if we were only presented with the results without an appropriate understanding 
of the functions. Advanced engineered systems are rather frequently modular in 
their overall design, but for evolved systems we do not even have the appropriate 
analytical tools to address the issue of modular decomposition. 

As a consequence, most studies on modularity in systems biology rely on oper- 
ational definitions that reflect to a large extent the biological system or data set 
from which the modules were extracted, as well as data quality and the available 
computational tools. The abstraction of a “module” will always be an approxima- 
tion to reality. This already holds for the concept of a gene and — as discussed above 
— this will be true to an even larger extent for complex cellular networks. Hence, 
these operational definitions should be judged by their value in facilitating the de- 
velopment of dynamic models and by the extent they enhance our understanding of 
these systems. Two extremes in definitions and analysis of modularity can be found 
in “bottom-up” and “top-down” approaches (see chapter 1) that we will discuss in 
the following. 


3.3.1 Bottom-up approaches 


Bottom-up approaches to a large extent build on existing biological knowledge. 
The function of the proposed module is well defined (at least under a limited set of 
conditions) and the individual members of the module are determined by detailed 
biochemical or molecular biological analysis, such as testing the effect of individual 
gene knock-outs on the function in question. In an ideal case the dynamic inter- 
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actions between the various components in a module are also known. This allows 
the validation of the proposed module in a dynamic context. If the quantitative 
behavior of this module, when studied in relative isolation from the rest of the 
entire intracellular regulatory network, provides an accurate and comprehensive 
description of the specific function in question, then the proposed module can be 
considered validated. Hence, the approach essentially constitutes a direct test of 
the “quasi-autonomy” that is characteristic of most definitions of a module. 

An excellent example of this approach is given by von Dassow and colleagues (von 
Dassow et al., 2000). In their paper on the quantitative analysis of the segment po- 
larity network of Drosophila, they first defined a module as a quasi-autonomous 
subsystem of a complex genetic circuit with a specific function. Their proposed 
module was built on a large body of knowledge of Drosophila differentiation from 
which they created a dynamic mathematical representation. With a few subsequent 
corrections and modifications, this modular representation robustly reproduced the 
qualitative in vivo dynamics of the specific differentiation process in question. More- 
over, its predictions were consistent with a wide array of experimental observations. 
Note that the components of the module participate in other cellular processes as 
well, so the modular character of the subsystem is specific to the process being 
modeled. 

It is evident from the description above, that bottom-up approaches require con- 
siderable effort in terms of assembling and validating modules. High throughput, 
large dataset—based, computationally aided efforts for module identification, there- 
fore, hold considerable appeal. The rationale of such approaches may be motivated 
by the following analogy: In the classical age of genetics, genes were tradition- 
ally identified by individually sequencing DNA fragments of limited size that were 
isolated based on the fact that the nucleotide sequence in question had some func- 
tional relevance in biological experiments. More recently, however, entire genomes 
have been sequenced in a wholesale fashion, and genes have to be extracted from 
a deluge of sequence information, often resulting in erroneous gene identification. 
Furthermore, in most cases the function of the putative genes has to be determined 
a posteriori. In other words, in the first case a nucleotide sequence is matched to a 
function of interest, and in the second case functions have to be found for existing 
sequences. 


3.3.2 Top-down Approaches 


Fueled by the overall accessibility of genome scale data sets, several top-down ap- 
proaches have been proposed for the high throughput identification of putative 
modules. These methods usually rely on the concept that intramodular “connec- 
tions” (whatever they may be) are more frequent than intermodular ones. The 
underlying assumption is that the number of interactions provides an indicator 
for how well embedded a certain component is into a subsystem; most approaches 
do not consider the “strength” of these interactions. Graph theoretical approaches, 
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Figure 3.1 Identifying one or more modules in a genetic regulatory network versus 
modular organization of the entire genetic network. The figures show a static graph 
representation of two genetic regulatory networks. The circles represent various proteins 
or genes whereas the edges represent regulatory interactions. In the genetic network of 
panel A, a module (circled) can be identified by the following criteria: each member of the 
module is connected to at least two other members of the same module. Note that no other 
module can be identified in this network. In the genetic network of panel B, a far larger 
number of modules are present. Note the overlapping modules marked by stars. Since the 
majority of protein and genes can be included into one or more modules, a certain level 
of modular organization is apparent. 


for instance, statically represent components as nodes and interactions as edges 
between these nodes (see chapter 7) (Figure 3.1). 

Graph representations of large-scale biological data sets are especially attractive 
targets for analysis of modularity because these simple representations can be 
analyzed for very large networks. In one study, von Mering and coworkers performed 
whole genome bioinformatics analysis of protein interaction networks (von Mering 
et al., 2003). In their work a functional module is a tight cluster of proteins in 
the protein interaction network. A similar approach was followed by Spirin and 
colleagues (Spirin and Mirny, 2003) while studying protein complexes in molecular 
networks: molecular modules were defined as sets of proteins that have more 
interactions amongst members of the set than with the rest of the protein interaction 
network. The logical end point of these static approaches to the identification of 
modules is in the definition given by Guimera and colleagues (Guimera et al., 2004), 
with modules assigned by a partitioning of the nodes in an interaction graph that 
maximizes a modularity cost function defined entirely in graph-theoretical terms 
(intramodule versus intermodule links in the graph, the sum of the node degrees 
within a module). 

In addition to direct physical interactions, modular connections may reflect 
regulatory relationships, such as shared regulatory inputs. A regulatory module, 
therefore, can be defined as a set of genes that are regulated in concert by a 
shared regulatory program that governs their behavior (Segal et al., 2003). Both 
the behavior and the modules assigned through analysis of the behavior may be 
dynamic and overlapping. Similarly, a transcriptional module may be defined as a 
self-consistent regulatory unit consisting of a set of coregulated genes as well as the 
experimental conditions that induce their coregulation, with modules decomposing 
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into higher-resolution modules when a resolution parameter is varied (Ihmels et al., 
2002, 2004a). 

Several points should be emphasized in connection with the above-described top 
down approaches: 


1. They produce only putative modules; their relevance has to be validated in a 
detailed manner as done for the bottom up approaches. This is especially true for 
methods relying on static graph representations, such as representations of protein- 
protein interaction networks (Spirin and Mirny, 2003). This is also true for modules 
extracted from time-course data, although, depending on the manner in which they 
were defined, such modules may carry over to dynamic modules more readily. 


2. Finding a large number of putative modules in a high-throughput analysis does 
not automatically translate into the modular organization of an intracellular net- 
work. For one reason, most high-throughput studies only consider one level of biolog- 
ical regulation (for example, transcriptional control, protein-protein interactions). 


3. Top-down approaches may often produce putative modules without a well- 
defined associated function, without which a reliable validation is significantly more 
difficult. 


4. The limitations of identifying modules by the above-described top down ap- 
proaches are evident. For example, a more or less linear signal transduction path- 
way will not show dense intramodular connectivity in protein interaction networks, 
and will therefore be missed by these methods. 





3.4 Definition of Modules for Dynamic Networks 


The bottom up method for module identification described above produces modules 
that could be, at least in principle, readily incorporated into dynamic network 
models — with the caveat that a dynamic model is needed as a prerequisite for the 
identification of a module. This would involve replacing the detailed dynamic model 
of the given module with a simpler system that would still correctly characterize the 
dynamic behavior of the associated function and also provide a sufficiently accurate 
description of the dynamic interactions between the module and its functional 
environment. Ideally, it would also allow deducing higher order cellular functions by 
combinations of modules. However, this module identification method, in addition 
to its essentially low throughput nature, comes with several caveats. It relies on 
the existence of a biologically interpretable “function,” and it barely takes into 
consideration that the module is sitting in the middle of, and has to be extracted 
from, a complex dynamic network. Modules for cellular level dynamic network 
modeling, however, are expected to satisfy other criteria. The main goals are: 


1. The modules should provide a significant level of abstraction, aiding in the 
simplification of an otherwise barely tractable dynamic network. 
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2. The various functions of the entire biological network are expected to be de- 
scribed even if the individual “dynamic network modules” cannot be associated 
with an easily interpretable or observable function, such as a given differentiation 
pattern. 


The biological function is thus approximated and replaced by an appropriate 
“mathematical function.” 

At a level of low complexity — that is, for small modules comprising few compo- 
nents and interactions — biochemical “building blocks” that perform (a small num- 
ber of) characteristic dynamic functions can be identified. For instance, the graph 
theoretical analysis of transcriptional regulation networks in E. coli (Milo et al., 
2002; Shen-Orr et al., 2002) and budding yeast (Lee et al., 2002) identified small 
over-represented “motifs” that can be attributed distinct functions, such as filtering 
noise or speeding up transcriptional responses in the case of the “feedforward mo- 
tif” (Mangan and Alon, 2003) (see also chapter 7). From a theoretical perspective, 
one can determine small “building blocks” that are required for obtaining classes of 
dynamic behavior such as adaptation, homeostasis, switching, and oscillations (see 
chapter 6 for details). 

Both approaches, however, are limited by the network size that can be assigned a 
distinct function. For instance, cellular circuits rarely employ one prototypic device 
to establish a biological function because robustness and efficiency of the function 
usually need additional complexity, for example, in the form of interwoven feedback 
circuits (see chapter 2 for examples). Hence, while the abstraction of “motifs” can 
provide insight into constituents of a “functional” module (in terms of biology), 
in general it is sufficient for neither the dynamic analysis nor the definition of a 
module. 

An interesting, “middle-out” approach has been proposed recently by El-Samad 
et al. (El-Samad et al., 2005). They studied the heat shock response system in 
E. coli, which as a first step involved developing a detailed mechanistic model 
for the entire system as defined by the traditional biological definition of the 
module, taking into account individual proteins and their interactions. In a top- 
down manner within this module, the authors have performed a systematic model 
reduction, and they have proposed the existence of certain functional submodules 
based on characteristics of the overall behavior of the entire system, such as 
robustness (see chapter 2) or optimal performance. This approach closely follows the 
method of modular decomposition routinely used in system engineering, namely the 
identification of submodules or devices based on their dynamic functions. Although 
the computational analysis suggests intriguing insight and circumstantial evidence 
for the proposed overall design and modular structure of the heat shock response 
system, experimental assignment of the various proteins to the various submodules 
and their functional validation remains to be performed. 

In a similar spirit, Kholodenko and coauthors proposed methods for the modular 
analysis of complex (signaling) networks, in particular with respect to the quan- 
titative identification of network topologies (Bruggeman et al., 2002; Kholodenko 
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et al., 2002). The system to be analyzed is a given network, the boundaries of which 
are determined, for example, by a biological function or according to the notion of a 
traditional signaling pathway. At a lower level, however, details of this network need 
not be resolved. Instead, operational modules are the subject of analysis. Responses 
of the modular system in steady state to perturbations are then described through 
interactions between the modules alone in order to quantitatively analyze signal 
transfer through the entire network (Bruggeman et al., 2002). Similarly, the ab- 
straction of modules can be employed for identifying networks of (partly) unknown 
structure through perturbations affecting one module at a time (Kholodenko et al., 
2002). Apparently, the definition of modules proceeds first irrespective of whether 
these modules correspond to a biological function. However, this abstraction po- 
tentially enables one to develop the dynamic models that are required for a more 
unambiguous definition of modules. Note that this necessarily involves an iterative 
process — a hallmark of systems biology in general (see chapter 1). 

In addition, the middle-out approaches discussed above also provide some guid- 
ance in determining whether a given dynamic intracellular regulatory network is 
modularly organized: if a wide variety of higher level functions of the entire dy- 
namic network can be comprehensively and accurately characterized by replacing 
the majority of individual genes and proteins by a significantly smaller number of 
dynamic modules, then a modular organization is likely to exist. 

The reader will notice gaps and tensions in this section on the identification of 
dynamic modules in biology. Current approaches cannot cover the huge gap between 
the levels of a few interacting components and of the cell as a whole. Tensions are 
evident because a proper definition of biological modules would require dynamic 
models, for the development of which focusing on a small part of the cellular 
networks is necessary — a classic catch-22. There are algorithmically implementable 
definitions available — see the top-down approaches using graph theory discussed 
above or the mathematically well-defined concept of metabolic pathways discussed 
in chapter 5 — but it is largely unclear if these definitions have any relevance for 
biological functions (which by themselves often require a more rigorous definition). 
Hence, for this field as for systems modeling in biology in general, only iterative 
processes may ultimately lead to a framework of methods by which parts of large 
dynamic networks could be collapsed into and replaced by relatively simple modules. 
A point to ponder, illustrating the challenges ahead, is that in the modeling domain 
in general there is no universal recipe for the task of model reduction. 





3.5 Conclusions 


Abstractions, such as modules, are required for analyzing complex systems, but they 
have obvious limitations, usually more often recognized by failure than foresight. 
Studying individual modules, especially those identified by bottom-up approaches, 
is appealing. Through such studies, one can make predictions and design and test 
desired changes in biological functions. Various approaches along these lines are 
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documented throughout this book. The behavior of biologically existing modules is 
studied, for example, in chapter 6, and human designed modules, with potential 
biotechnological consequences, are described in chapter 13. As biologists have 
identified an increasing number of genes associated with functions, the study of 
individual modules has started in earnest, reaching the level of quantitative dynamic 
approaches during the past couple of years. Models of the Ran nucleocytoplasmic 
transport (Smith et al., 2002), and the EGF receptor pathway (Schoeberl et al., 
2002) provided an accurate, predictive description of their respective modules. 
However, these modules have not been coupled to others in order to attempt a 
higher-level integration of cellular functions. Therefore, the integration of dynamic 
functional modules as well as their rigorous definition and identification remain to 
be investigated. 





II MODELING APPROACHES 


A Bayesian Inference of Biological Systems: 
The Logic of Biology 


Vipul Periwal 


Systematic model selection and inference in modeling biological systems must deal 
with the specific problems of incomplete prior knowledge, limited heterogeneous 
data, and similar but not identical model systems. In addition, the model selection 
process must allow incremental updates as new data becomes available. Probability 
theory as embodied in Bayes’ theorem is the unique logically consistent framework 
for such reasoning. The foundations of Bayesian inference are summarized with 
some excursions into information theory and search theory. Some recent examples 
taken from the recent literature are reviewed. 





4.1 Introduction 


Reasoning in biology imposes three general desiderata on the reasoning process: 


1. We must reason with incomplete prior knowledge of and limited data on the 
biological system under study. For example, we may have microarray or proteomics 
data with little knowledge of cellular localization. 


2. We must be able to update our inferences taking into account new data, without 
having to revisit the entire reasoning process. For example, we should be able to 
add mass-spectroscopy data to our inference based on expression data. 


3. We must be able to combine observations in multiple model systems, with no 
sense in which the different systems are merely repetitions. For example, we should 
be able to use knowledge of expression levels in a pathway in different bacteria to 
make more trustworthy inferences of aspects of the regulation of the pathway, even 
though there is no sense in which the observations are repetitions. 


The mathematical rules of probability theory (Jaynes, 2003; D’Agostini, 2003) are 
the unique consistent rules for conducting plausible reasoning in such a setting. It 
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must be emphasized here that, given partial observability and incomplete knowledge 
characteristic of biological systems, all probabilities may be considered conditional 
probabilities, especially conditioned on the state of knowledge of the biologist. 
The mathematical rules of probability theory, applied consistently, will lead to a 
consistent and optimal revision of the probabilities in light of new evidence. Given 
adequate data and possibly quite different initial assignments of probabilities, two 
different experimenters will usually arrive at convergent inferences, provided that 
the rules of probability theory are consistently applied. 

The aim of this chapter is to explain the basics of plausible reasoning relevant 
for systems biology. As with the rest of this book, the foundational material 
presented here is intended to facilitate understanding between scientists with 
different backgrounds and to allow workers access to more specialized tracts with 
a basic understanding of the issues. The application of plausible reasoning to 
biological systems is not a novel idea and is well-documented in the medical 
literature (Lusted, 1968). The same desiderata given above in the systems biology 
context also apply to medicine, so this should not come as a surprise. Probabilistic 
reasoning has also been applied to systems biology in many papers, under the 
terminology Bayesian networks or graphical models (Jordan, 1998; Pearl, 2000). 
This chapter is intended to provide a foundational perspective on the logic that 
underlies such applications. 

Notation: A proposition is a statement that may be true or false, for example 
A = “The upregulation of Erbb1 leads to increased expression of 8-catenin.” The 
term probability is used in this chapter in the sense of a quantitative assignment 
of a degree of plausibility to a proposition. Clearly such a probability has nothing 
logically to do with the number of times a proposition is observed to hold in a 
repetition of an experiment. p(A|B) is the probability that proposition A is true, 
given that proposition B is true. The negation of a propostion is denoted A. The 
proposition “A and B” is denoted AB, and the proposition “A or B” is denoted 
A+B. 

A sampling distribution or likelihood function is a rule for assigning probabilities 
to data, given that a hypothesis is true, p(data|hypothesis). Thus, sampling theory 
is concerned prototypically with problems of the form: given the contents of a cell, 
determine the probabilities of drawing a certain set of messages. Scientific inference, 
on the other hand, is concerned with problems of the form: knowing the observed 
expression data, determine the contents of the cell. Much of this chapter is devoted 
to this inverse problem: how do we calculate p(hypothesis|data)? 

In the most general context, the biological system under investigation may be 
characterized by a set of unobserved and unknown variables, X, for example the 
phase of the cell cycle, and localization and concentration information for a large set 
of proteins, but the available experimental data, D, may be only a few protein and 
mRNA measurements. We expect a functional relationship of the form D = F(X), 
and we hope to extract X from D by inverting F : X = F~1(D), but typically the 
data is not sufficient to allow this inversion. Biological systems are never completely 
observed experimentally. Thus experiments exhibit variability that is often termed 
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“noise”, as a short-hand for uncontrollable effects. Biological systems often exhibit 
fundamental stochasticity too in their mechanisms of action, but this stochasticity 
and the former noise are completely unrelated to the probabilities that we are 
concerned with here. Finally, there is the randomness associated with experimental 
protocols, for example small differences in aliquots of mRNA extract, which for the 
purposes of inference is in the same category as the other unknown variables in X. 

We may have useful information relating D and X in the form of likelihood 
functions, which give us the probability of observing D given a certain set of values 
X. For example, supppose we observe a certain level of the phosphorylated form 
Pp; of a protein P. Using the Heaviside O(x) function, which vanishes for x < 0 and 
is equal to 1 for x > 0, our prior probability for Py given P is 


p(Py|P) = O(P — Ps)/P, (4.1) 


reflecting the probability that, given the total level of P, the observed level of Py 
must be less than P. Suppose we know that only the phosphorylated form Py is 
stable, and that the concentration of the message corresponding to P is M. The set 
of reactions is, to the best of our knowledge, 


M = P = Py 


l l (4.2) 
Du D 


where Dm and D are the products of other reactions/decays of M and P, re- 


spectively. What would be a plausible prior probability for p(P|P;, M)? Knowing 
nothing else, we could start with 


p(P|Pr, M) = O(P — P;)O(30 M — P)/(30 M — Py), (4.3) 


which quantitatively expresses our expectation that P must be higher than Py 
and lower than M at equilibrium, and incorporates our knowledge that P is 
ubiquitinated in the unphosphorylated state. The factor 30 might reflect our 
ignorance of the precise translational control of M, based on a review of the 
literature. These different functional forms for the likelihood function are reflections 
of our differing biological knowledge in the two cases: Message does not necessarily 
get translated into protein, but protein does not get made without message being 
expressed. These examples may be too simple to be of practical value, but the key 
point is central: any quantitative model or hypothesis linking unobserved quantities 
and observed quantities can be translated into a likelihood function. In a sense, this 
is implicit in the whole idea of a quantitative model and gives the data a meaning 
in the context of the system under study. 
There are two basic rules for the evaluation of probabilities: 


1. Product Rule: p(AB|C) = p(B|C)p(A|BC). 
2. Sum Rule: p(A|B) + p(A|B) = 1. 
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From these two rules, it is possible to derive relations between probabilities, for 
example 


P(A + BIC) = p(A|C) + p(B|C) — p(ABIC). (4.4) 


Two hypotheses, A and B, are independent if knowledge of the value of B does not 
affect our knowledge of A: 


p(A|B) = p(A). (4.5) 


Two hypotheses A and B are conditionally independent if knowledge of the value 
of a third hypothesis C along with knowledge of B does not constrain A : 


p(A|BC) = p(A|C). (4.6) 


Conditional independence does not imply unconditional independence. 

In the context of probabilistic reasoning, we always need to ask: against which 
specific alternatives are we testing a model or hypothesis? Probability theory cannot 
invent alternative hypotheses for the biologist. Given some previously established 
set of prior probabilities, p(A|X), where A is a hypothesis and X represents prior 
data, if we obtain some new data D, we can use the product rule to compute 
posterior probabilities: 


p(AD|X) 
p(D|X) 


p(A|DX) = = pao Por, (4.7) 


thus updating our estimation of the plausibility of our hypothesis in light of the 
new data D. This is usually referred to as Bayes’ rule. This can be written more 
symmetrically as 

p(DIAX) _ p(A|DX) 


PDX) ~ p(AlX) ° ca 





provided that the denominators do not vanish. This rule expresses exactly the fact 
that the proportion by which the data D affects the probability of the hypothesis 
A is the same proportion by which the hypothesis A affects the likelihood of the 
data D. Notice that other hypotheses are implicit in the update rules since 


P(D|X) = $ p(DA|X) = X r(A|X)p(DIAX), (4.9) 
A A 

summed over all hypotheses considered. In some cases, there may be an implicit 

alternative hypothesis in the problem, but in no case can one carry out probabilistic 

reasoning without a comparison of alternatives. 

A note on terminology: We will use the terms prior and posterior fairly often, 
and it is important to emphasize here that “logical implication” is not the same as 
“biological causation”—in other words, we can infer a probability for a biologically 
earlier event from knowledge of a temporally later event. Thus, prior information 
is not necessarily about temporally prior events. For example, snow on the road in 
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the morning may lead to a plausible inference that snow fell during the night, even 
though the causal connection goes in the opposite direction. 

Probabilistic reasoning requires no optimization over unknown parameters. This 
would be akin to eliminating hypotheses explicitly by choosing only certain specific 
hypotheses based on non-probabilistic reasons and would render the logical con- 
sistency of the entire process suspect. The logically correct approach is to sum or 
integrate over unknown quantities, a process known as marginalization, so that the 
effects of unknown quantities, sometimes referred to as nuisance variables are aver- 
aged over all plausible values, weighted by their degree of plausibility as embodied 
in their probabilities. 

What if the data comes out to have low probability with respect to the chosen 
prior distribution? This is not a disaster, nor does it imply that the reasoning 
process has broken down. Rather, it implies that the hypotheses encoded in the 
prior distribution are inadequate, and that new biology is needed to explain the 
data. 

The flexibility of the probabilistic framework is daunting, since the biologist 
is required to think about known biology in order to formulate quantitative 
hypotheses for analyzing the data. The payoff is that the biology is front-and- 
center in the whole process. Prior knowledge is the input to mathematical models of 
biological systems. It is the biologist who is responsible for the connection between 
mathematics and reality. In particular, the expectation that enough data collection 
will automatically lead to emergent realistic models is a fallacy. Modeling and data 
collection cannot be separated: it is the analysis of new data that leads to posterior 
probabilities for alternative models, and it is the plausible models that must be used 
to guide the acquisition of new data. A guide to the choice of a sufficient number 
of plausible hypotheses is the value one obtains for p(D|A). The key is to pick a 
set of hypotheses that are sufficient to explain the data, without making the set of 
hypotheses so general that the data is implausible. 





4.2 An Example 


A simple example (Skilling, 1998) should help clarify the reasoning process. As- 
sume that we are given a liquid, known to be water or ethanol, and a ther- 
mometer, accurate to +2.5K. We need to determine the probability that the liq- 
uid is water, given the temperature reading T on the thermometer. Let X be 
the true temperature of the liquid. We start by noting the a priori probabil- 
ities: p(water) = p(ethanol) = 0.5, given our lack of further information, and 
p(X|water) = 1/100 for 273K < X < 373K, and p(X|ethanol) = 1/160 for 
193K < X < 353K. The likelihood function, given the uncertainty in the instru- 
ment, might be modeled as p(T|X, water or ethanol) = 0.2 for T between X — 2.5 
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and X +2.5. In this particular case, we are assuming that the measured temperature 
uncertainty is independent of the liquid. We first note that 


p(water, X) = p(X|water)p(water) = 0.5 p(X|water), (4.10) 
so 
p(T|water, X) p(T|water, X) 
p(water, X|T) = p(water, X )———_—_—_——. = 0.5 p(X |water )———___—-. 
(water, XIT) = plwater, yO (X|water) A e 
(4.11) 
Now, suppose we measure T = 271K. p(T) is obtained by summing over the 


hypotheses, since its role is to normalize the probabilities: 


p(T) = 05 fax [P(T jwater X)p(X|water) + p(T ethanol, X)p(X |ethanol) | . 


(4.12) 
Taking into account the values of X for which the probabilities in the integral are 
non-vanishing, we find p(T = 271K) = 0.5(0.5/100 + 5/160) = 0.018125. It follows 
that, marginalizing over X since we are interested in the classification of the liquid, 
not the nuisance variable X that we needed to introduce to formulate our hypothesis 
quantitatively, 


0.0025 

p(water|T) = J eXp(oater, XIT) = 01825 0.14. (4.13) 
Thus, in this example, we find that the odds ratio for water is 0.14/0.86 = 0.16, and 
the odds ratio for ethanol is 0.86/0.14 = 6.14. It would seem that the hypothesis 
that the liquid is ethanol has much better odds than its alternative. The posterior 

distribution p(liquid|T) is quite different from the prior distribution p(liquid). 
This example has a set of hypotheses labelled by both a discrete variable 
(water or ethanol) and a continuous variable (the true temperature X), a common 





circumstance in biological inference where there are structurally different models 
and continuous rates and concentrations that all need to be part of the set of 
hypotheses considered. Furthermore, very often in biology we are not as interested 
in the most likely values of rates and concentrations as we are in finding the probable 
qualitative structure of the model, even though it isn’t possible to formulate the 
model mathematically without the introduction of numerical rates. This example 
also shows the importance of averaging over the so-called nuisance variables, 
marginalization. 





4.3 Information Theory 


Probability and information are intimately related (Welsh, 1988; Cover and 
Thomas, 1991). If we have an observed variable X, for example the concen- 
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tration of leptin, which takes the values 2;,i = 1,...,n, with probabilities 
p(X =2;)=pi: >>; pi = 1, the entropy of X is defined by 
= >D pi logy pi. (4.14) 


The logarithm to the base 2 is a normalization convention and leads to a unit 
entropy H(X) = 1 for a variable X that takes the values 0 or 1 with equal 
probabilities. The variable X is said to require one bit of information to describe 
it. The entropy is maximized when all the probabilities p; are equal. An intuitive 
way of thinking about this maximum is that in such a case, we have no reason to 
prefer any of the n alternatives over the others. In other words, we are maximally 
uncertain about the n alternatives, and the entropy measures the uncertainty in 
our knowledge. 

Any specific measurement /observation of a variable is an event. The information 
of an event E with non-zero probability is defined as 


I(E) = — log, p(E). (4.15) 


If X is an observed variable, each of the values it takes has an associated information 
— log, pi, so the mean value of the information associated with the observations of 
X 0, piI(zx;) is in fact the entropy of X. This is the fundamental relation between 
the entropy and information of observed variables. The intuition for this relation is 
simply that the obtaining information is simply the removal of uncertainty, both of 
which are measured by the entropy. 

The conditional entropy of X given an observation E is defined as 


H(X|E) = “2 = «;|E) logy p(X = z;| E). (4.16) 


Similarly, if Y is some other measured variable, with associated values y;,j = 
1,...,m the conditional entropy of X given Y is 


HIY) = Sal = HCA =w) (4.17) 


What does the conditional entropy measure? Notice that H(X|X) = 0, since 
p(X =2,|X = z;) = 6;;. Extending this, H(X|Y) = 0 if and only if X = f(Y) for 
some function f. In words, the conditional entropy vanishes if the observed value 
of X is completely predicted by the observed value of Y. On the other hand, if X 
and Y are independent, H(X|Y) = H(X). We also note that the joint entropy of 
X and Y, H(X,Y) satisfies 


A(X,Y) < H(X)+H(Y). (4.18) 
In fact, it is not difficult to show that 


H(X,Y) = H(Y) + H(X|Y), (4.19) 
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showing that the conditional entropy exactly measures the uncertainty remaining 
in our knowledge of X, given our knowledge of Y. 

The relative entropy, sometimes called the Kullback-Leibler divergence, of a set of 
probabilities p; for a measurement X and another set of probabilities q; for the same 
measurement (for example, these could be the prior and posterior probabilities) is 
defined by 


D(p\q) = Pi logs (pi/di). (4.20) 


D is always non-negative and only vanishes if p; = q for all i. In terms of 
information theory, the information about X contained in Y is 


I(X|Y) = H(X) — H(X|Y) = (Y |X) = DW(XY)|p(X)p(Y)), (4.21) 


which is symmetric in X and Y. This information is often called the mutual 
information of X and Y. If X and Y are independent, the mutual information 
vanishes. If the value of X is predicted by the value of Y, H(X|Y) = 0, and the 
mutual information is just the information in X. Mutual information is often used 
as a similarity measure in expression array (Butte et al., 2000) clustering of genes, 
but it is not a “distance measure” in the sense that it does not satisfy the triangle 
inequality 


d(x,y) + d(y,z) > d(x, z), (4.22) 


which holds for any three points x, y, z in Euclidean space. This inequality expresses 
the intuition that the length of any side of a triangle is less than the sum of the 
lengths of the other two sides. However, 


m(X,Y) = H(X|Y) + H(Y|X) (4.23) 


is symmetric in X and Y and does satisfy the triangle inequality: m(X,Y) + 
m(Y,Z) > m(X,Z). If X and Y are independent, m(X,Y) = H(X) + H(Y), 
and if X = f(Y) and Y = f~'(X), then m(X,Y) =0. 

In even a compressed account of information theory it is necessary to mention 
the connection of entropy with coding theory. Briefly, if we think of compressing 
our measurements D = {d;,i = 1,...,n} into words made from an alphabet of 
N symbols, the average length of the words will be at least H(D)/log, N. This 
result makes it possible to estimate the entropy H(D) by finding an encoding 
of the data, in situations where, due to a lack of data or prior knowledge, we 
are unable to compute the entropy directly. If we use an encoding in terms of 
an alphabet consisting of 0 and 1, the average length of the words will be an 
upper bound on the entropy. Suppose each d; is the result of an expression array 
measurement in a particular condition. We discretize d; into bins defined by our 
expected uncertainty in the measurements. (The choice of binning can also be 
incorporated into our description of the information, but we do not do so here for 
the sake of simplicity.) A partitioning of the G genes into N(< G) subsets with na > 
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Ola = 1,..., N) elements has a probability p(N,G) = N(G — N + 1)!N(N-©) /G!, 
so its information is Heiuster(N,G) = — logs p( N, G). (Prior information can be used 
in this step to reduce this information by taking into account information from the 
literature on known interactions between genes—the effect of this is to reduce the 
total number of independent genes G to some smaller number, using prior biological 
information to place certain genes together in a cluster.) In terms of these putative 
clusters, we only need N numbers to describe d; instead of the original G. However, 
we now have to contend with the inaccuracy of our compression as well, in other 
words, with the information in the residuals €; gene = di gene — di, cluster- The original 
information is 


Hong = 5 Hgene(d) (4.24) 


genes 


where we compute Hgene(d) by considering how we could encode the data. If 
the range of values that the gene takes over the n experiments is m, we need 
about (log, m)” bits to encode the values. We also need to encode the information 
specifying the range of values for each gene, or use the same range for all the genes, 
and avoid specifying the range for every gene. Taking into account the information 
required to specify the clustering, the new information is 


Hina (N) = Hetuster( N, G) + +> Heluster + 5 Heene(€), (4.25) 


cluster genes—clusters 


where we also restrict the sum in the residual information to be over the genes 
that are not cluster centers (where we define the cluster center in a variety of ways, 
for example as the gene that exhibits the least deviation from the median of the 
cluster over all the n experiments). The cluster center will, by definition, show no 
deviation from the value accorded to the cluster. The term Heluster is the term 
that favors model simplicity. At one extreme, there is a unique clustering of one 
cluster of G genes, which amounts to the original data expressed as residuals, and 
at the other extreme G clusters of one gene each, in which the residuals vanish. At 
both these extremes, the Hauster(N = 1, G) = Hauster(NV = G, G) = 0. We expect a 
good clustering to reduce the amount of information in the residuals, because many 
of the entries in the residuals should vanish, and this should be balanced by the 
amount of information required to specify the clustering. For example, if the residual 
matrix is a sparse matrix, the coding required to specify it is just the gene name 
or index, the experiment index i, and the true value for every non-zero entry. This 
encoding will obviously take a lot less information to describe than the entire matrix, 
provided that the clustering is an accurate description of the correlations between 
gene expression values. Evaluating Hgnai(V) for different values of N (minimizing 
over different choices of {nq} for each N) gives us a criterion for picking the number 
of clusters. We can also use this approach to cluster the experiments. 
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4.4 Another Example: Probabilities Are Not Frequencies 


Suppose we have expression data from several samples, normalized to message 
counts per cell. This is not the normalization commonly used for expression data, 
but the example will show that this normalization is helpful for certain consid- 
erations. The problem is to figure out the message counts in each cell, given the 
expression measurements. Abstractly posed, the problem is that there are a variety 
of colored balls in different jars, each jar corresponding to a sample. We have taken 
a handful of balls from each jar, corresponding to the expression data. We want to 
find the probable contents of each jar (Jaynes, 2003). 

Let’s focus on just one color of ball, red. We have drawn n balls out of a jar, and 
r of them have been red. What is the probability that we would draw r red balls 
in n tries if the total number of balls in the jar is N and the number of red balls in 
the jar is R? Since we are drawing the balls without replacement, this probability 
is easily computed. The probability of the first ball drawn being red is 


N- 
p(r = 1N, Ryn =1) =, with p(r = 0|N, R,n = 1) = = (4.26) 





The probability of the second ball being red is 


plr = 2|N, R,n = 2) = p(r = 1|N, Ryn = 1) x p(r = 1|(N — 1), (R- 1),n = 1), 
(4.27) 
and so on. Therefore, the probability of one of the two balls being red, p(r = 
1|N, R,n = 2), is 


plr = 1|N, R,n = 1) x p(r =0|(N — 1), (R- 1),n = 1) (4.28) 
+p(r = 0|N, R,n = 1) x p(r = 1|(N — 1), (R-1),n=1) (4.29) 
= ZROB) (4.30) 


N(N=1) * 


A little further calculation shows that 


p(r|N, Ryn) = A 5 (*) (* E a (4.31) 
in general. 


Now, N and R are unknown. Having drawn r red balls out of n balls, we know 
of course that N > n and R > r. According to the rules of probability theory, 


(n, rN, R) 


P(N, Rlnyr) = PNPN EE (4.32) 


since p(N, R) = p(N)p(R|N). What is p(n,r)? It is a normalization constant given 
by 


co N 
p(n, r) = 5 pP(N)p(R|N)p(n, r|N, R), (4.33) 
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where p(n,r|N, R) obviously vanishes if N < nor R< ror N-R<n-r. 
Where is the biology in all this? It is in the probabilities we assign to p(N) and 
p(R|N) based on our biological knowledge. For example, the samples may be tissues 
taken from cancer or normal tissue. In this case, we might expect that cancerous 
cells are proliferating rapidly and may have an overexpression of genes involved 
in cell division, compared to normal samples. However, this proliferation has a 
metabolic cost as well, so these cells may also have an overexpression of messages 
corresponding to, for example, glucose transport. There may, on the other hand, 
be genes that are expressed at the same level as in normal tissue. Thus, if R 
corresponded to any of these classes of messages, we would have different prior 
distributions for p(N, R). We might even choose to use the product rule differently, 
and factor p(N, R) = p(R)p(N |R), if, for example, we knew the approximate rate of 
proliferation of the cells, and R was a message encoding for a mitotic spindle protein. 
The central point is that known biology dictates the choice of prior distributions. 

We can also employ this logic in an exploratory mode, assuming that we do not 
know what form p(N, R) should take, and compute p(N, R|D) conditioned on all the 
data D we have available. Label the different samples with an index a = 1,2.... 
Suppose that the red balls correspond to a particular message, and we wish to 
ascertain if R scales as N° for some non-zero power a or if R is independent of 
N, based on our samples. For each a, we computed p(Na, Ralna, ra) as described 
in the previous paragraph, using some prior distribution po(V, R). Since we do not 
know what to expect, other than the fact that R < N, we should choose po(N, R) 
to reflect this ignorance, which is sometimes referred to as choosing a maximally 
uninformative prior. If we do have some biological knowledge to guide our choice, we 
need to incorporate it into the prior. There is no point in using uninformative priors 
when information is available. To this end, we can iterate through the samples, 
computing successively 


pna, ral N, R, nı, rı) 


P(N, R]|na, T2, N1, T1) = P(N, R|nı, rı) 
p(n2, ra|nı, rı) 


s (4.34) 





and so on, until we finally arrive at p(N, R|D) where D stands for the entire data set 
{n1,71,...}. We can now compute H(R|N) = H(N) — H(R, N) from p(N, R|D), 
and answer our question: If H(R|N) ~ 0 then R is a function of N, conditioned on 
the given data D. 

It should be noted here that there was no assumption in our considerations that 
the different samples were repetitions of some experiment. The probabilities that 
we calculate are not some measure of frequency of occurence in some idealized set 
of infinite numbers of trial experiments. In general, there is no logical consistency 
to assuming that probabilities are frequencies. Probabilities are nothing more or 
less than quantitative expressions of our state of knowledge. For experiments where 
the results are exchangeable sequences (for example, the identical experiment is 
performed n times), the expectation of the frequency of a particular result is 
numerically equal to the probability: E(f;) = pi. So the probability is an estimate 
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of the frequency, but to understand the uncertainty in this estimate, we need to 
compute the covariance of f; and f;, which leads to 


E(fifi) ~ EGJEQ;) = (Pi — pips) + (öp: Dy) (435) 
Here pij is the joint probability of outcomes 7 and j at two different repetitions of 
the experiment. It is clear then that there is a finite n correction, and a non-zero 
Pij — pip; correction to the probability assessment of the frequency which does not 
vanish even for infinite n. For the small numbers of repetitions available due to 
resource constraints in expression array measurements, for example, it is important 
to keep the finite-size correction in mind. In the particular case i = j and pi = p? 


we get 


E(P) = EG? = “pill = pi). (4.36) 


If we are, conversely, attempting to assess probabilities by studying observed 
frequencies, these relations are again relevant. 





4.5 Search Theory, or “Use the Information, Stupid” 


We have a search space of possible models. We have finite resources available to 
find a correct model in the search space. How should this search be conducted? 
(Jaynes, 1985) Suppose we divide the space to be searched into n subsets, with 
search parameters m;,7 = 1,...,n for each subset. The search parameters measure 
the fractional “size” of the subset in terms of search difficulty, and satisfy >, m; = 1. 
For example, the m; parameters could be the fractional volumes of the subsets: 
Larger subsets would take longer to search and therefore would have larger m; 
values. We also have probabilities p; assigned to each subset, also adding up to 
unity. These probabilities are our assessments of the presence of a correct model 
in a given subset. If a correct model is present in subset i, the probability that a 
search effort z will lead to finding it is 


p;(discovery|z) = (1 — exp(—z/m;)). (4.37) 


Search effort, which might be computer time or expression level measurements, is 
limited to be C. If we start with some prior probabilities p® 


search effort on subset i, the posterior probabilities will be 


(1) _ ps” exp(—2i/m;) _ p® exp(—z;/mi) 


Pi = 
ae pP exp(—zj/m;) l= pp 


if a correct model is not located, where pp is the probability of finding a correct 
model with the given search efforts z;. Intuitively, this shows that if we search a 
subset and do not find a model fitting the data in that subset, then the probability 


, and expend z; of our 





(4.38) 
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for the model being in that subset decreases. In this situation, how should we expend 
our search effort? 

The information we possess about the subsets is measured in two relative en- 
tropies, which measure the sizes of the subsets versus the probabilities that a correct 
model is present in a subset. We define 


l= Soi log(pi/m:), (4.39) 
i=1 
and 
J= 5 mi log(m:/pi). (4.40) 
i=1 


Clearly, I > 0 and J > 0, with J = J = 0 only when p; = m; for all i = 1,...,n. 
As we search, we are expending our search effort z continuously, up to a maximum 
value C, so we can think of p;, J, and J as functions of z, starting from initial values 
pi(0), 1(0), (0). Why are J and J relevant for search? We calculate 


J(z) = J(0) + log(1 — pp) + z, z=% z; (4.41) 


which can be rewritten as 
pp = 1 — exp(-(z + 2)), 2 = J(0)- J(z). (4.42) 


Thus, the detection probability pp decreases if J(z) increases in the course of the 
search. The best we can do is to make J(z) decrease, but since J > 0, the optimal 
strategy is to reach J = 0 and to conduct further search so as to maintain J(z) = 0. 
In other words, we need to use up all the information we possess about the subsets 
by allocating search efforts z; among the subsets to reach the J = 0 state. 

Let us suppose that we want our expenditure of search effort to be optimal at 
all steps. We may not know how much computer time we will have available before 
an abstract needs to be submitted, for example. How should we allocate our next 
infinitesimal bit of search effort, dz? Notice that (p;/m;)max > 1, since J; pi = 
X mi = 1, and 6J = (1 — p;/m,;)6z if the search effort is expended in subset j. It 
therefore follows that we should search the subset with the highest value of p;/m, 
at any given step. We order the subsets so that (pi/m1) > (p2/m2) > (p3/ms3).... 
We search subset 1 until po/m2 = pı/Mı (note that all p; are functions of the 
search effort expended, z). We then treat subsets 1 and 2 as one large subset and 
search it keeping po/m2 = pı/mı until pı/Mı = p2/Mə2 = p3/msz at which point 
we treat the subsets 1, 2, and 3 as one big subset and proceed as before. This part 
of the search process continually decreases J and J and increases pp until all the 
ratios p;/m; are equal. This is the state of maximum uncertainty characterized by 
I = J = 0. Having used up all our information, the best the rest of the search can 
do is to maintain this state until all the search effort available has been expended. 
An interesting point about this strategy is that it can be stopped at any given step, 
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and we can be certain that we have done the best that we could have, given the 
information that we had available. Such finite-resource optimal strategies are likely 
to be very important in large-scale biological inference, given computational and 
experimental limitations. 





4.6 Computational Techniques 


While systems biology is generally associated with large-scale data collection, when 
it comes to inference of biological processes as a complex system, the scale of 
data collection is meager, and computational resources to analyze the data are 
limited. If our space of hypotheses has more than a few components, the entire 
set of probabilities p(X|D) cannot be exhaustively computed, since there is a 
combinatorial explosion in the computational cost. The computational problems 
can be overcome with variants of Markov chain Monte Carlo (MCMC) methods 
(Skilling, 1998; D’ Agostini, 2003). In general, Markov processes are processes where 
the next step only depends on the present location, not on the previous history of 
the process. A Monte Carlo method refers to a stochastic method for evaluating 
a quantity, for example estimating the value of an integral. An MCMC method 
marries the two, using a stochastic Markov process to generate new data points for 
the Monte Carlo estimation of the quantities of interest. 

One of the key points is to consider biologically interesting questions. For exam- 
ple: we may want to know the probability that a certain hypothesis x is supported 
by the data D. In other words, of all the models that we can generate to fit the data 
with our full set of hypotheses X, we want to ask how likely is it that x is used in 
the models that fit the data. We construct a function on the space of models, I,(Y), 
which takes the value 1 if x is used in the model Y and 0 if x is not. An example 
of x might be (a quantitative version of) “IKK activates NF-KB translocation to 
the nucleus.” We can now evaluate the expectation Æ of I,(Y) over the space of 
models by computing 


E(I(Y)) = $ aY DAY). (4.43) 
Y 


This is, typically, a huge (or infinite) summation, and impossible to compute exactly. 
The trick is to approximate this summation using MCMC methods. 

The Metropolis algorithm is a particular implementation of MCMC computa- 
tions: We start from some initial model Yo, and compute p(Yo|D) and I,,(Yo). We 
modify the model by changing the hypotheses incorporated or by changing the rate 
constants or kinetic parameters, generating a new model Y}, for which we also com- 
pute p(Y1|D). (Since p(Y, D) = p(D)p(Y|D), we are free to neglect the constant 
factor p(D).) If p(¥i|D) > p(Yo| D), we accept the change and add I,(Y1) to our 
previous computation of I,,(Yo). However, if p(Y¥i|D) < p(Yo|D) we compute a ran- 
dom number r between 0 and 1 and accept Y, only if p(¥i|D)/p(Yo|D) > r. If we do 
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not accept Yı, we accept Yo again and try a new change in the model, and repeat 
the process. After n accepted models, we compute an approximation to E(I,(Y)) : 


EY) © = L(Y. (4.44) 


The convergence of this approximation to the exact value scales as n~!/?, which is 
slow but not impossible to compute. In the event that the probability of generating 
Yı from Yo is not symmetric, in other words, the probability of generating Yı 
from Yo is different from that for generating Yo from Yı, a Hastings ratio p(Y, > 
Yo)/p(Yo — Yı) multiplies the ratio p(Y,;|D)/p(Yo|D). This factor is significant in 
Markov chains on spaces of models. 

Another approach to MCMC computations is to use genetic algorithms. The 
interesting point about these algorithms for the purposes of model selection is 
that they are better suited to multimodal problems and problems with discrete 
variables (common in testing collections of hypotheses, for example). The main 
point is to start with a family of P sample models and generate new models 
by two means: mutations (changes in a single model) and crossovers (exchanging 
hypotheses between two distinct models). Other means of generating new models 
can be used as well, as long as the method is reversible. For example, changing one 
of the models in the population based on the difference between two other models in 
the population is a possible way to exploit the population as a whole, and not just 
one or a pair of the models in the population. It is particularly useful for biological 
applications if the allowed mutation and crossover transformations are actually 
biologically feasible alternative mechanisms for implementing a given biological 
function. Having implemented the genetic algorithm, it remains to explain how this 
fits in with the expectation computation of interest. For this we just have to go back 
to the Metropolis algorithm, described above, and think of each genetic algorithm 
step as a step on the P-fold product of our space of models. We now apply the same 
acceptance or rejection criterion to each step, except that the probabilities that we 
use are computed as the product Mz p(Yal|D). The expectation is computed by 
picking a random selection Y; out of the P models in the population at each step, 
and again using 


EY) = = Y (Yi). (4.45) 


Asymptotically, this will again converge to the true expectation. 

In this way, by computing expectations, we can assign plausibilities to our 
set of hypothetical interactions ¥. A note of caution: Taking the most plausible 
hypotheses in X and putting them together in a model does not necessarily result 
in a model that has a high probability p(Y|D) since there may be correlations 
or anticorrelations between hypotheses. In other words, there may be alternative 
explanations for a phenomenon which may be antagonistic. To figure out which 
hypothetical interactions combine well to match the data, we could, for example, 
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compute the pair-wise expectations E (Is (Y )I-(Y )) where x and z are hypotheses in 
X. If we pick a threshold for the mutual information t, these pair-wise expectations 
lead to a graph where the hypotheses are vertices and the edges are links between 
hypotheses with mutual information I(z|y) = H(x) + H(y) — H(a,y) > t (using 
H(z) = —B(Ie(¥)) logy EQ (Y))— (1- EQ (Y))) log (1—E(Ue(Y))) and similarly 
for I,(Y) and I,(Y)I.(Y), up to a finite sample size correction). The connected 





components of this graph (or a subset of them) would be an interesting starting 
point for more detailed model inference. 

An important technical point to speed up the computations is to use simulated 
annealing. In this numerical method, based on analogies with statistical mechanics, 
we replace p(Y, D) = p(Y)p(D|Y) with p(Y)p(D|Y)* where 0 < A < 1. 1/X plays 
the role of temperature, so when we approach A = 1 from higher values of the 
temperature (lower values of A), gradually the peaks and valleys in log p(D|Y) 
get more pronounced, and therefore make it less likely that a step that would 
decrease the probability is accepted. Thus, at small values of A it is easier for the 
MCMC update algorithm to find acceptable steps resulting in a wider coverage of 
the population of models. As A is brought closer to the true value 1, the MCMC 
update steps will stay in the vicinity of the optimal model found at lower values of 
A. It is the process of reaching the optimal model that is shortened by the cooling- 
down phase of simulated annealing. Care must be taken in the multi-modal case to 
find the expectations around each locally modal value. This is usually accomplished 
by repeating the calculation with different starting points. 





4.7 Three Applications 


There is a wide range of applications of probabilistic inference in systems biology, 
indicative of the universality and flexibility of the methodology expounded in this 
chapter. In this section, we review briefly three examples from the literature: 


1. The use of multiple types of experimental data to organize genes in modules 
(Lee et al., 2004) 


2. Model selection on the basis of a Bayesian comparison of models (Sachs et al., 
2005) 


3. Studying the sensitivity and specificity of Bayesian inference of genetic regula- 
tory interactions (Husmeier, 2003) 


A specific application of probabilistic techniques that is used in two of these 
examples is the concept of a Bayesian network: If we have a set of measured 
quantities, and the probabilities of the values observed of some of the quantities 
are conditional on the observed values of some of the other quantities, we can draw 
a graph of dependencies, with the measured quantities represented as nodes and 
directed arrows going into every measured quantity from the measured quantities 
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upon which the probabilities of its values are conditional. An example of a Bayesian 


network is 
A 
/ N 
B C (4.46) 
X ri 
D 
while 
A 
vA T 
B C (4.47) 
as Z 
D 


is not a Bayesian network. The parents of a node are the tail ends of the arrows 
pointing to that node. If the graph obtained in this manner has no directed cycles 
(in other words, no closed loops with arrows all connected head to tail), then it is 
called a Bayesian network. This special case is computationally much more tractable 
than the general case (usually referred to as a graphical model). The calculational 
tractability arises from the fact that variables can be easily marginalized in Bayesian 
networks, since the probability distribution factorizes: 
n 
p(X1,X2...Xn) = | [ p(Xilparents of X;). (4.48) 
i=1 
Feedback loops in a graphical representation of dependencies between different 
quantities correspond to networks that are not Bayesian networks, according to 
the definition above. The general Bayesian logic expounded in previous sections is 
still applicable, of course, but the analytical simplifications that go along with the 
factoring of the probability distribution do not hold. A better way to understand the 
consequences of feedback loops, in any event, is to think of the probabilities in a time 
dependent context, which amounts to taking a particular graphical representation 
and unfolding the arrows in a new time direction. As an example, the graph of 
probabilistic dependence 


SS (4.49) 


70 


Bayesian Inference of Biological Systems: The Logic of Biology 


unfolds to 


Biyi Cipi Ati (4.50) 





Crue Atte Bite 











v + 


Thus variables gain an additional time label and the arrows point from variables at 
one time-slice to the next. The resulting graph is certainly acyclic, and is therefore 
a Bayesian network. Such unfolded networks are referred to as dynamic Bayesian 
networks. 

Lee et al. (2004) considered several different sources for deriving gene-gene in- 
teraction information: mRNA coexpression across microarrays, gene fusions, phylo- 
genetic profiles, co-citations, and protein interaction experiments. They calibrated 
the likelihood that any given one of these sources was reliable by picking KEGG 
pathway database annotations and computing the ratio of the frequency with which 
the source linkage operated in the same pathway as the KEGG annotation to the 
frequency with which the source linkage operated in different KEGG pathways. 
They normalized this ratio by picking random pairs of genes and computing the 
ratio of the frequency with which the genes operated operated in the same KEGG 
pathway to the frequency with which the pair of genes operated in different KEGG 
pathways. They use the logarithm of this normalized ratio as a score for the accu- 
racy of the source. In the context of this chapter, their likelihoods for the accuracy 
of any given source of information was determined by their data for that source, 
conditioned on the KEGG database. They then used these probabilities to score 
gene linkages that were not in the KEGG database, but were predicted by the 
source. Since the probabilities for each source were independently obtained, they 
could produce a cumulative log likelihood score for each linkage by adding up each 
individual score. Thus the framework of probability theory allowed an integrated 
use of all available data to predict the reliability of a given gene-gene interaction 
linkage, placed on a common scoring basis. 

Learning the probability distributions p(X;|parents of X;) for a given Bayesian 
network is a major part of determining the probability that the network is a 
likely description of the data. These probability distributions are specific to each 
hypothesized model since the dependencies between the entities in the network 
may differ between models. kSachs et al. (2005) applied this procedure to infer the 
most likely protein signaling network from multi-parameter flow cytometry data, 
emphasizing the importance of data in the presence of different perturbations in 
network inference. The role of a given perturbation is to fix the measured values of 
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certain variables in the network, and therefore constrain the possible dependencies 
in the set of probability distributions. 

The requirements for successful reverse engineering of genetic regulatory networks 
using dynamic Bayesian networks are considered in Husmeier (2003), who shows 
the importance both of known biology in the form of possible interactions and of 
time series data in disequilibirum after a perturbation as the system relaxes in 
inferring the network structure. This work also shows the promise (and limitations) 
of MCMC methods in inferring local structures of the genetic network. Rather 
than try to infer a single most likely network, the importance of marginalizing over 
models is apparent in these results, since the posterior probability distribution on 
the space of models is diffuse for sparse data sets. 





4.8 Summary 


The logical analysis of biological data has many advantages, indicated in the intro- 
duction. It has only one “disadvantage”: Known biology must be incorporated in the 
analysis from the beginning, and thought must be expended on the translation of 
this knowledge into quantitative models. The consistency, optimality and unique- 
ness properties of logical inference imply that one cannot “do better” in extracting 
knowledge from new data. 

The key steps are: 


(A) Encode known knowledge into prior probabilities for models that are plausible 
explanations for the new data. 


(B) Compute the likelihood of the new data for these models. 

(C) Compute the posterior probabilities for the models using the likelihoods (B) 
and the prior probabilities (A). 

(D) Examine the likely models using these posterior probabilities and ask what 
experiment would differentiate best between these likely models. 


(E) With new data (D), go to step (A) with the posterior probabilities now serving 
as the prior probabilities. 


This general scheme of inference applies to sequence analysis on one end to reverse 
engineering on the other end with no change. A consistent application of the simple 
rules of probability theory is all that is needed. 


5 Stoichiometric and Constraint-based 
Modeling 


Steffen Klamt and Jörg Stelling 


A major current challenge in systems biology is to clarify the relationship between 
structure, function, and regulation in complex networks that can be reconstructed 
from genomic or biochemical data. However, dynamic mathematical modeling of 
large-scale networks meets difficulties as the necessary mechanistic detail and 
kinetic parameters are rarely available. In contrast, structural (topological) analyses 
require only reaction stoichiometries and reversibilities, which are often well-known. 
This chapter introduces the main concepts of stoichiometric network analysis, a 
special class of structural analysis methods. We emphasize practical applications for 
obtaining a system-wide understanding of metabolic networks, including functional 
and regulatory aspects. In particular, we aim at providing a critical evaluation of the 
different theoretical approaches available regarding their prerequisites, predictive 
power, and inherent differences. This approach should finally enable the audience 
to make critical judgments on the applicability of stoichiometric network analysis 
for their special problems in systems biology. 





5.1 Overview and Applications 


One of the most important challenges in systems biology is to understand the 
functionality of cellular networks that can be reconstructed from genomic and 
biochemical data for a wide variety of organisms. Current theories have different 
strengths and shortcomings in providing an integrated, predictive description of 
complex networks. For dynamic mathematical modeling of large-scale systems, 
often the necessary mechanistic detail and kinetic parameters are not available. 
In contrast, structure-oriented analyses only require the usually well-characterized 
network topology. Graph theory uses the scheme of network connectivities, which 
is a simplified representation of real reaction networks (see chapter 7). Here we 
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introduce a class of analysis methods that consider network stoichiometry explicitly 
and potentially other constraints such as maximal pathway capacities as well. These 
approaches can be subsumed under the term stoichiometric network analysis (SNA) 
(Heinrich and Schuster, 1996; Simpson et al., 1999). 

Stoichiometric modeling has become a particularly important approach for un- 
derstanding the function of metabolic networks. Hence, we focus on metabolism, 
and we discuss extensions to cellular regulation. One aim of this chapter is to criti- 
cally review virtues and limitations of the approaches with respect to their potential 
applications for realistic biological networks. For this comparison of altogether four 
major approaches to stoichiometric network analysis, we will address the following 
issues: 


m Network consistency: Blocked reactions and missing network elements can 
compromise the validity of reconstructed networks. They should be detectable by 
analytical methods. 


= Functional pathways and cycles: Pathways should be sets of connected re- 
actions, but establishing a theoretically sound notion of “meaningful” pathways is 
difficult. Pathway analysis may suggest new hypothetical routes between specific in- 
puts and outputs that only emerge in the context of a complex network. Identifying 
“futile cycles” that involve only a net consumption of energy can help to recognize 
potential energy-wasting routes. Cycles without any net energy consumption point 
to thermodynamic inconsistencies. 


m Network capabilities: The evaluation of, for instance, maximal product yields 
in terms of the moles of product generated per mole of substrate has clear rele- 
vance for biotechnological applications. Stoichiometrically derived yields may give 
indicators of the maximal efficiency of engineered organisms. The identification of 
alternative optimal pathways, or of sub-optimal pathways can, however, be of equal 
importance with regard to the feasibility of genetic engineering approaches. 


= Importance of reactions: A prominent application of network analysis is to 
determine the importance of single reactions for the overall systems performance, 
in particular, by studying knockout mutations. Predicting the effects of enzyme 
deficiencies that cause human diseases is of clear medical relevance. Estimates 
of the relative importance of a reaction may differentiate between essential and 
nonessential genes under specific (environmental) conditions. 


= Correlated reactions: Reactions that always have to operate together are likely 
to be coregulated. This applies to many unbranched linear pathways in biosynthesis. 
Reactions that never appear together point to differential regulation, for instance, 
to establish qualitatively different network operation modes depending on the 
environmental conditions. Hence, such groups of reactions help to understand, and 
possibly predict, features of regulatory networks. 


m Network design: Studying of the effects of adding reactions to or deleting 
reactions from a given network is closely related to analyzing the importance of 
single reactions. In addition, it can unravel how (additional) constraints on reaction 
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reversibilities influence the set of possible pathways in the network. Assessing 
the effect of newly introduced genes with respect to functional capabilities and 
potential, unanticipated side effects in silico could help to identify targets for the 
addition or removal of genes in vivo. 


a Network flexibility and robustness: Robustness is generally defined as the 
(relative) insensitivity of a system to changes in its parameters (Csete and Doyle, 
2002) (see chapter 2). Flexibility means the capacity to switch between different 
functional modes. Here we regard both concepts as equivalent because a metabolic 
system should tolerate changes in its set of enzymes once it provides for alternative 
pathways when a specific reaction is not functional. Hence, one has to investigate 
the set of all possible behaviors of a system. 





5.2 Stoichiometric Networks 


Before we turn to theoretical approaches for stoichiometric network analysis (SNA), 
we first need the introduction of some (formal) terms related to the stoichiometry 
and structure of biochemical reaction networks. A biochemical reaction is usually 
characterized by the following properties: 


= Stoichiometry: The stoichiometry specifies the reactants (educts or products) 
participating in a reaction as well as the molar ratios in which they are consumed or 
produced. The stoichiometric coefficient of a metabolite, by convention, is positive 
if it is produced when the reaction proceeds in its forward direction, and negative 
otherwise. 


a Reaction directionality: In principle, all chemical reactions are thermodynam- 
ically reversible. Certain reactions in biochemical networks, however, can be con- 
sidered to be practically irreversible because they (nearly) exclusively proceed in 
one direction. Examples include the irreversible fixation of carbon dioxide by the 
most abundant enzyme in nature, namely Rubisco. Knowledge on the reversibility 
of reactions, as will be seen in the next sections, allows to constrain the number 
of possible pathways in a network, since pathways that would involve reactions 
proceeding in the “wrong” direction can be excluded from the analysis. 


a Catalyzing enzyme: Many biochemical, in particular metabolic, reactions are 
characterized by the participation of an enzyme that facilitates or even enables 
a reaction to proceed. The connections between reactions and enzymes do not 
have to be unique, because several enzymes (isoenzymes) may catalyze the same 
reaction, whereas multifunctional enzymes have the ability to catalyze different 
reactions. Specification of the catalyzing enzyme, however, allows one to directly 
relate structural network properties to features of the genome encoding those 
enzymes. 


=a Reaction kinetics: Reaction kinetics describe the dynamics of the reaction 
based on the reaction mechanism and the enzyme properties. In many cases, these 
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characteristics of a reaction are unknown. However, the metabolism is characterized 
by usually fast reactions and high turnover of substances when compared to 
regulatory events. Then, at least for certain modeling aspects, dynamics may be 
neglected (see below). 


In structural analyses of biochemical networks, only the first three properties are 
considered. A formal description of the structure and stoichiometry of a reaction 
network can be given as follows: 


= m: number of (internal) species 


= q: number of reactions; if desired, the catalyzing enzyme(s) and the corresponding 
gene(s) can be assigned to each reaction. 


m N: q x m stoichiometric matrix—each row corresponds to one species and each 
column to one of the reactions; the matrix element n;; represents the stoichiometric 
coefficients of species i in reaction j. 


= rev: the set of the reversible reactions 


a irrev: the set of the irreversible reactions (rev N irrev = f) 


The structure of any reaction network can be captured by this formalism. In 
the following we will focus on metabolic networks because stoichiometric network 
analysis is especially suited for them. Note that in metabolic networks, biomass 
synthesis may be considered as a pseudo reaction whose (cumulative) stoichiometry 
can accordingly be collected in one of the columns of N. 
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Figure 5.1 Example network EN1: its graphical and formal representation. 


Fig. 5.1 shows the map and the corresponding variables of a simple example 
network, called EN1 throughout the paper. This network comprises 6 (internal) 
metabolites and 10 reactions, two of which are reversible. Characteristics of any 
network are its boundaries and its connections to “the rest of the world.” Related 
to this issue is the notion of internal and external species. Internal species are those 
which are explicitly considered in the network model (and, hence, in N). In contrast, 
external species are thought to be sinks or sources (Heinrich and Schuster, 1996), 
which can lie physically outside the system (for example, substrates or products as 
the four external compounds in figure 5.1), but might also be located inside the 
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cell. If, for example, metabolite E in figure 5.1 represented a metabolite in great 
excess, such as water, we would probably not consider this metabolite as part of 
our model and neglect it as is done in many stoichiometric (and also dynamical) 
studies. External metabolites are the reason for why the stoichiometric coefficients 
of a reaction in N may have only positive (R1, R2 in EN1) or only negative (R3, 
R4) signs. 

The stoichiometric matrix N is fundamental, not only for SNA. First, the 
underlying graph of a given reaction network—needed for graph-theoretical studies 
(see chapter 7)—can easily be derived from the stoichiometric matrix. Furthermore, 
N is essential also for dynamic modeling of metabolic processes. The changes of 
the species concentrations over time can be described by a system of differential 
equations (see chapter 6 for details of the approach) as follows (Heinrich and 
Schuster, 1996): 


de(t) _ 
oo N - r(t) (5.1) 


The m x 1 - vector c(t) represents the current metabolite concentrations and the 
q xX 1 - vector r(t) represents a flux distribution in the network, that is, it contains 
the q reaction rates. Vector r(t) is given by a—often approximated—function of 





the current metabolite concentrations and of many—often unknown or uncertain— 
parameters (contained in vector p), that is, 


r(t) = f(c(t),p,t) . (5.2) 


Hence, as already mentioned above, the uncertainties in describing a metabolic 
system dynamically lie within the kinetic description of the reaction rates. However, 
the other part of equation 5.1 is given by N, which in most cases is well-known 
and represents an invariant of the system. N is invariant against time, kinetics, 
and concentrations (although, under certain conditions, only subnetworks, that is, 
submatrices of N may be active). N describes the structural relationships between 
the network components which are of eminent importance for the overall function 
and behavior of the network. Therefore, results obtained by stoichiometric network 
analysis do often have direct implications also for the dynamic behavior. Of course, 
since equation 5.2 is practically neglected, only some of the major characteristics 
of a metabolic system can be extracted by SNA. 





5.3 Conservation Relations 


Conservation relations (CR) characterize weighted sums of metabolite concentra- 
tions which remain constant in the system. Here, concentrations are denoted by 
brackets. A typical example occurring frequently in studies on metabolic networks 
is [NADH] + [NAD] = S = CONST. When one of these cosubstrates is consumed, 
then the other is produced, keeping the sum of both concentrations constant. For 
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NADH and NAD, this is reflected by the phenomenon that the corresponding row of 
NAD in the stoichiometric matrix N is exactly the same as for NADH, except that 
it is multiplied by —1. This actually means that the rows are linearly dependent. 

It is a general property of any conservation relation y that it represents a 
combination of rows (species) of N that are linearly dependent (Heinrich and 
Schuster, 1996). Linear combinations of rows of N can be represented by NTy 
(NT = transpose of N), and finding linearly dependent rows means that y must 
fulfill: 


N’y =0 (5.3) 


(0 is the m x 1 zero vector). This means that a CR y must lie in the null-space of 
the transpose of N. One also says that y lies in the left null-space of N (Strang, 
1980), since equation 5.3 is equivalent to y? N = 07. The dimension of the left 
null-space is m-rank(N), that is, conservation relations only exist if rank(N)< m. 
Then, m-rank(N) linearly independent CRs can be found, which can be arranged 
in a matrix Y. For terms related to a null-space see also section 5.4, where the 
null-space of N is analyzed. 

Network EN1 (figure 5.1) does not contain any CR since rank(N)=m=6. A simple 
example would be a network that contains the four metabolites A,B,C,D and only 
one reaction: A + B — C + D. In this case, 


N= (5.4) 


and, hence, three linearly independent CRs exist (because m-rank(N) = 4—1 = 3). 
They can be found by searching for linearly independent solutions y for 


N’y=[-1 -111])y=0 (5:5) 


Three selected independent solutions for y are arranged as columns in the matrix 
Y: 


1 10 
-1 0 1 

Y= (5.6) 
0 10 
0 01 


In the order of the columns this means (i) [A] — [B] = S1 = CONST.; (ii) [A] 
+ [C] = S2 = CONST; (iii) [B] + [D] = S3 = CONST. Furthermore, each linear 
combination of these CRs is also a CR, e.g. (i) + (ii) = 2 [A] — [B] + [C] = CONST. 

Identifying the CRs is a simple task, but brings important benefits. First, CRs 
are helpful for detecting conserved moieties by searching only for those CRs that are 
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composed of a positive sum of metabolite concentrations (Heinrich and Schuster, 
1996; Cornish-Bowden and Hofmeyr, 2002). Algorithms for this task (similar to 
those for computing elementary modes, see section 5.6.3) exist (Heinrich and 
Schuster, 1996). Secondly, CRs are a nice example for how stoichiometric relations 
affect systems dynamics: CRs shrink the possible dynamic behavior (equation 5.1) 
of a given network. If at a given time point, say at the beginning of the simulation 
or experiment, the value of a CR is known, then it will be constant for all the time. 
In our small example above, this would mean that if [A] - [B] is 6 at any time point 
then there will never be a state of the system where the difference between [A] and 
[B] will be unequal to 6. For this reason, CRs express redundancies with respect to 
the considered states of the systems. It is therefore possible to remove m-rank(N) 
states from the set of (modeled) system variables without losing information. In our 
example above, we might, thus, remove the three metabolites B, C, D and model 
only A explicitly. Using the CRs and the initial concentrations, we can then derive 
the concentrations of B, C, and D from the current concentration of A at any time 
point (Heinrich and Schuster, 1996; Reder, 1986). 





5.4 Balanced Networks: The Quasi Steady State Assumption 
5.4.1 Metabolite Balancing Equation and Null-Space of N 


Metabolism usually involves fast reactions and high turnover of substances when 
compared to regulatory events. Therefore, analysis of metabolic networks is often 
based on the assumption that, on longer time scales, metabolite concentrations and 
reaction rates are constant. Applying this quasi (pseudo) steady state assumption 
to equation 5.1 leads to the fundamental metabolite balancing equation (Heinrich 
and Schuster, 1996) 


O=Nr. (5.7) 


This homogeneous system of linear equations demands that the production 
(sum of positive fluxes) and consumption (sum of negative fluxes) of a metabolite 
must be equal, similar to Kirchhoff’s first law for electric circuits. As we will 
see in section 5.5, the metabolite balancing equation is the main constraint in 
constraint-based modeling. Note that in oscillating systems, where the metabolite 
concentrations are not constant (Wolf et al., 2000), equation 5.7 is fulfilled at least 
for the averaged reaction rates. 

The trivial solution r = O always fulfills equation 5.7. However, this would 
represent thermodynamic equilibrium. We are, for obvious reasons, only interested 
in other solutions, and the cell should (must) have degrees of freedom. Indeed, as 
the number of reactions q in real networks mostly is much larger than the number m 
of internal metabolites, an infinite number of flux distributions r usually complies 
with the system of equations (5.7). From linear algebra, it is known that all possible 
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solutions are contained in a vector space called the null-space (or kernel) of N (cf. 
with left null-space in equation 5.3) (Strang, 1980). The dimension of the null- 
space is q-rank(N), which equals the number of linearly independent solutions for 
equation 5.7. A set of q — rank(N) linearly independent solutions can easily be 
found and is arranged in a kernel matrix K. Then, all flux distributions r fulfilling 
equation 5.7, that is, which lie in the null-space of N, can be constructed by a linear 
combination b of the columns of K: 


r=Kb. (5.8) 


For illustration, figure 5.2 shows the map and formal representation of a very 
simple network called EN2. The null-space has dimension g-rank(N)=4-2=2. 
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irrev = {R1,R2,R3,R4} 
Figure 5.2 Example network EN2. 


A kernel matrix for this system, which accordingly must have two columns, reads: 


0 


is (5.9) 


ee Of 


0 


A special balanced flux distribution in this network is r = (2,1,1,2)" which can be 
constructed from K by using b = (2,—1)” in equation 5.9. 

Note that the kernel matrix is, in general, not unique. For example, we may 
substitute one of the columns of K in equation 5.9 by the vector r given above. 
Therefore, usually not all qualitatively different flux distribution in the network 
are captured. An even more problematic point for analyzing the null-space by the 
kernel matrix is that neither sign (reversibility) nor other capacity restrictions of the 
reactions are considered. For example, as all reactions in EN2 are irreversible, the 
second column of K is not a valid flux distribution in this network because for R2 
a negative sign occurs. It can even happen that a null-space has many dimensions 
(that is, many columns in K), although no other steady state flux than the trivial 
one is feasible in the network. Hence, the “real” degrees of freedom (possibilities for 
distributing metabolic fluxes) can only roughly be estimated from the dimension of 
K. These shortcomings are overcome by constraint-based approaches (section 5.5). 
Some important steady state properties of the system can, nevertheless, be derived 
from K. 
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5.4.2 Analysis of the Kernel Matrix 


It may happen that a reaction can only have a zero rate in steady state. This applies, 
but is not restricted to, whenever an internal “dead-end” metabolite participates 
only in this reaction. If this reaction carried a non-zero flux, then this metabolite 
could not be in steady state. Because of equation 5.8, many blocked reactions (BRs, 
also called “strictly detailed balanced reactions” (Heinrich and Schuster, 1996)) can 
easily be identified from (any) K if their corresponding row in K is a zero row. 
Checking a network for BRs is especially useful in reconstructed networks, since 
BRs can hardly perform any function and therefore often indicate missing network 
elements. For any further network analysis involving the steady state assumption 
(sections 5.5-5.6), they can for practical reasons be removed. 

An enzyme subset (ES), or coupled/correlated reaction set, is a set of reactions 
that must always operate together with a fixed ratio in their rates (Pfeiffer et al., 
2001). Typical examples are reactions in a linear pathway, such as {R4,R7,R10} 
in EN1 (figure 5.2). The rates of these reactions will be equal in any steady state 
flux distribution. Consequently, if one reaction is removed from the network (for 
example, by a gene deletion), then the others cannot work properly and their flux 
will be zero in steady state. Since the reactions of an ES are structurally so strongly 
coupled, they are often commonly regulated (Schuster et al., 2002c). ESs are not 
restricted to linear pathways as shown by EN2 (figure 5.2), where {R1,R4} is the 
only ES. ESs can be verified by the null space matrix because the corresponding 
rows in K of two reactions of the same ES can only differ by a (scalar) factor. 
In equation 5.2, the factor for the corresponding rows for R1 and R4 in K (first 
and fourth row, respectively) is even unity, which means that the reactions operate 
always with the same stationary rate. 

Other important conclusions can be drawn if K is block-diagonisable. Then, 
certain sub-networks can be identified in the system that are either completely 
disconnected or whose steady state fluxes are independent from the fluxes in the 
rest of the network (Heinrich and Schuster, 1996). 


5.4.3 Metabolic Flux Analysis 


By applying metabolic flux analysis (MFA), one tries to shrink the possible solution 
space of equation 5.7 by measuring some of the reaction rates (such as uptake or 
excretion rates) in a certain steady state experiment (Stephanopoulos et al., 1998). 
Ideally, one unique solution (a point in the null space of N) remains for the actual 
flux distribution in the respective experiment. The procedure is straightforward: one 
divides equation 5.7 into the measured (index m) and unknown part (u), possibly 
after rearranging the columns in N and components in r: 


O=Nr=Niuru+Nmtm > Nuru = -Nnmrim - (5.10) 
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The right part of equation 5.10 is the central equation for MFA and characterizes 
a flux scenario. The ideal case with only one unique and exact solution occurs 
if N, is a square matrix and invertible because then all unknown rates in rą 
can be determined. However, in general, from the rank of N,, a scenario can 
be classified with respect to determinacy (determined or underdetermined) and 
redundancy (redundant or non-redundant). If a scenario is underdetermined, then 
only some or even none of the unknown rates can be determined. In redundant 
systems, a consistency check can be performed, which is useful for detecting gross 
measurement or modeling errors. The basic techniques for MFA are extensively 
described in Stephanopoulos et al. (1998); van der Heijden et al. (1994); and Klamt 
et al. (2002). In larger networks, despite a number of measurements, many or all 
rates in the system often remain completely unobservable. Then, only isotopic tracer 
experiments may deliver further constraints (Wiechert, 2001). 

To give a small example, we assume that in EN1 we measured the rates RI=R3=2 
and R4=1 (figure 5.3). We could then calculate R2=R7=R9=R10=1. The other 


three rates remain unknown. 
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Figure 5.3 Example for metabolic flux analysis: stationary rates of R1, R3, and R4 are 
measured (bold arrows). Using this information, one can determine the fluxes of R2, R7, 
R9, and R10 (dashed arrows). The other rates remain unknown (thin arrows). 


In general, MFA is useful for analyzing specific flux distributions, but it is not 
able to characterize the complete admissible steady state solution space. 





5.5 Constraint-Based Modeling 


5.5.1 Principles of Constraint-Based Modeling 


In the previous section we introduced the metabolite balancing equation which 
resulted from the assumption of quasi steady state. As a consequence of this con- 
straint, the space of possible flux distributions in a reaction network reduces from 
“everything is possible” to the null-space of N. The basic idea of the constraint-based 
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approach, mainly developed by B.O. Palsson and colleagues, is to incorporate fur- 
ther well-defined physicochemical and biological constraints that limit the network 
overall behavior with respect to the possible flux patterns (Varma and Palsson, 1993; 
Reed and Palsson, 2003; Price et al., 2003, 2004). As a result, the solution space, 
encompassing all flux distributions satisfying the imposed constraints, shrinks. Dif- 
ferent types of constraints can be involved and they can all be expressed by linear 
equations or inequalities: 


C1) Quasi steady state: 0 =N r 

C2) Capacity /Reversibility: a; < ri < pi 

For all irreversible reactions one usually sets œ; = 0 in C2. Flux capacity constraints 
are often known for exchange (uptake/excretion) reactions. If capacity constraints, 
for internal reactions normally given by the Vmax value of the enzyme, are unknown 
then the boundary values of the reaction rates are set to +00. C2 can be simplified 
to a pure reversibility constraint when no capacity values are known/considered: 





C2’) Reversibility: r; > 0 (for all irreversible reactions i) 
C3) Measurements: r; = m; (for measured/known rates i) 
C4) Optimality: str = sırı + sero + ... + Sq%q = Maz! 
Note that null-space and metabolic flux analysis can be seen as special constraint- 


based methods which take into account the constraints Cl (+partially C2) and 
C1+C3, respectively. 


Rate 2 





ese” 








Rate 1 
Figure 5.4 Example of a convex polyhedral cone. 


Constraints C1 and C2’ are in practice often well-known in a given network. The 
set F of all flux vectors r obeying these constraints 
F={re R’:0=Nr, r; >0V i € irrev} (5.11) 


represents, mathematically, a convex polyhedral cone (Rockafellar, 1970; Bertsimas 
and Tsitsiklis, 1997). In stoichiometric studies, it is often referred to as flux cone. 
According to C1 and C2’, this cone is an intersection of the null-space with the 
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positive halfspaces of the irreversible reactions. An example of a three-dimensional 
polyhedral cone is given in figure 5.4. As suggested by this picture, the edges of 
such a cone are of eminent importance; they are the subject of pathway analysis 
(section 5.6). 

The constraints C2-C4 further restrict the cone to a smaller subset of flux vectors 
representing then, in general, a polyhedron. Note that the optimality condition C4 
is not always considered as a constraint. However, one may treat it as the same 
since it reduces the space of flux vectors of interest as the other constraints do. The 
optimality condition C4 is central to the approach of flux balance analysis, which 
is introduced next. 


5.5.2 Flux Balance Analysis 


Flux balance analysis (FBA) seeks to identify extreme patterns of flux distribu- 
tions that keep the network balanced (constraint C1), are thermodynamically fea- 
sible (C2) and maximize a linear objective function (C4). Thus, the characteristic 
and necessary assumption of FBA is the optimal function of the network expressed 
by the optimality constraint C4. The three constraints C1, C2, and C4, in math- 
ematical terms, represent a linear optimization problem (Kauffman et al., 2003a; 
Bertsimas and Tsitsiklis, 1997), which may be optionally extended by measurements 
(C3). In most cases, the (linear) objective function is the maximization of growth 
or product yield. The vector s in the linear objective function used in C4 represents 
the optimization criteria and weights the reaction rates. For maximizing the growth 
rate, for example, only the coefficient corresponding to the growth rate is set to one 
and all others to zero. As an example, assume we want to maximize the yield of P 
(reaction R3) for growth on substrate A in network EN1 (figure 5.1). The variables 
for the constraints then read: 


a Stoichiometry (for C1): N as given in figure 5.1 
= Boundaries (for C2): a = (0, —oo, 0, 0, 0, 0, 0, —co, 0, 0); 
B= (1, 0, 00, 00, 00, 00, 00, 00, 00, 00) 


= Linear objective function (for C4): s = (0,0, 1,0, 0,0, 0,0, 0,0) 


Note that only ag and ag are —oo because only R2 and R8 are reversible. 
Furthermore, we set 32 = 0 (B cannot be taken up), because exclusive growth on 
substrate A is considered. Only c3 is non-zero as we want to optimize R3. Finally, we 
assume that the maximal uptake rate of A is 1 (84 = 1). Using available computer 
routines like the simplex algorithm (Bertsimas and Tsitsiklis, 1997), one can easily 
solve such a linear optimization problem. In our example, one might get a solution 
as shown in figure 5.5 with an optimal yield of P/A = 1. 

The following main applications of FBA became attractive for metabolic engi- 
neering, but also for systems biology: 


= Predicting optimal yield and optimal behavior: FBA enables one to predict 
production capabilities of a micro-organism (Varma and Palsson, 1993). This is of 
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Figure 5.5 Optimal flux distribution for producing P from A in EN1. 


high interest for industrial applications (Stephanopoulos et al., 1998; Nielsen, 1998) 
and FBA can also be used to search for optimal knock-outs (Burgard et al., 2003) 
with respect to certain criteria. Furthermore, bacteria such as FE. coli have been 
shown to behave (stoichiometrically) optimal with respect to biomass yield, at least 
under selective pressure (Edwards et al., 2001a; Ibarra et al., 2002). Thus, at least 
for certain conditions, the quantity of this optimal behavior can be calculated in 
silico. 

a Predicting functionality and phenotypes (after gene deletions): A very 
useful application of FBA is to investigate whether a certain function can be per- 
formed at all in a network, especially after removal of network elements (simulating 
gene deletions): if a reaction is removed in the network (that is, from N), then one 
may optimize the network again. If the optimal value, for example, for the growth 
rate, now becomes zero, then one definitely knows that this function (growth) is 
not possible anymore. This procedure has been applied, for example, in (Edwards 
and Palsson, 2000; Förster et al., 2003), and it could be shown that the prediction 
“growth is/is not possible” agrees well with the real phenotype. Especially when 
a function is possible, although the in silico analysis of the network predicts the 
opposite (false negative prediction), there must be an error or something missing 
in the considered network. 


a Flux coupling: FBA can be used to analyze flux couplings in a network (Burgard 
et al., 2004). Similar to investigating the null-space matrix, blocked and fully 
coupled reactions may be identified, but reversibility constraints are explicitly 
considered. Additionally, weaker couplings may also be identified, for instance, 
where one reaction is used when another reaction is active, but not automatically 
vice versa (as for R1 and R5 in EN1, for example). The results of such investigations 
can help inferring the underlying regulatory rules. 


The usefulness of FBA has been proven in many applications, in particular for 
microbial model organisms (Price et al., 2004), but there are also limitations one 
should be aware of. FBA critically depends on the optimality criterion applied. Not 
all cells, and bacterial cells not under all circumstances, will behave stoichiomet- 
rically optimal. This means that, in general, network capabilities but not the real 


86 


Stoichiometric and Constraint-based Modeling 


phenotype can be predicted. Moreover, the optimal value of the objective function 
is unique and an optimal solution will usually be found. However, especially in large 
(genome-scale) networks, the calculated optimal flux distribution itself may be not 
unique. Look at our optimal solution in figure 5.5. It is easy to find another optimal 
flux distribution that also realizes optimal yield (P/A =1), such as the left one in 
figure 5.6. We can even (linearly) combine this solution with the one in figure 5.5 
(here with a factor of 0.5 for both) yielding the right flux map in figure 5.6. Thus, 
actually, infinitely many optimal flux distributions exist even in this small network. 
Therefore, in most cases, albeit the additional constraints C2 and C4 of FBA shrink 
the solution space considerably, infinitely many solutions can remain. FBA delivers 
one particular optimal solution. Thus, even if optimality is assumed, it may happen 
that only little can be said about the internal behavior, that is, how the fluxes are 
distributed inside the cell (Mahadevan and Schilling, 2003). 
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Figure 5.6 Two further optimal flux distributions for producing P from A in EN1. 


However, often one can specify some reaction rates that are fixed in any optimal 
solution. For the example optimization problem in EN1, we could derive that R4, 
R7, and R10 must be zero during optimal behavior (because they are involved in 
side production of E) and that R1, R3, and R9 carry a fixed flux of unity. Thus, 
only R5, R6, and R8 remain variable. Fixed rates in optimal flux patterns can easily 
be identified (Mahadevan and Schilling, 2003). Moreover, one may also determine 
the qualitatively distinct optimal solutions (as the two in figs. 5.5 and 5.6 (left)) 
for a given FBA problem, for example, by mixed-integer linear programming (Lee 
et al., 2000), or—in smaller networks—by elementary modes as described in a later 
section. 


5.5.3 Minimization of Metabolic Adjustment (MoMA) 


The analysis of the stoichiometric implications of gene deletions is one important 
application of FBA, because FBA can find a new (optimal) flux distribution. Even if 
the wild type grows optimally, mutants may not necessarily behave optimally with 
respect to their retained resources. Instead they could adjust their metabolism 
with minimal effort (Segre et al., 2002). This assumption suggests that the cell 
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searches for the nearest solution in the new feasible space of steady state flux 
distributions, which is part of the wild type solution space. Formally, this leads 
again to a constraint-based problem, where w represents the optimal solution of 
the wild type and d the index of the deleted reaction whose rate is set to zero: 


Nr = 0 (5.12) 
Qi < Ti S pi 
Ta = 0 


(r — w)? (r — w) = Min! 


The first three terms correspond to C1-C3 in the usual FBA, whereas the fourth 
term leads to a quadratic programming problem whose handling, however, is 
mathematically straightforward (Segre et al., 2002). 

For E. coli mutants, this approach lead to better predictions than FBA (Segre 
et al., 2002). However, MoMA at first needs the flux distribution from the wild type, 
which is also assumed to be optimal and, hence, determined by FBA. Therefore, 
MoMA also faces the problem of non-unique optimal flux distributions in the wild 
type. It can, thus, also result in non-unique solutions for the mutant (Mahadevan 
and Schilling, 2003). Hence, for MoMA it is essential to identify the real flux 
distribution in the wild type under a given environment. 





5.6 Pathway Analysis 


Pathway analysis deals with the discovery and analysis of meaningful routes in 
(primarily) metabolic networks using the concepts of extreme pathways (EPs) and 
of elementary flux modes (EFMs) (Papin et al., 2003). In contrast to FBA or MFA, 
it characterizes the complete space of admissible steady-state flux distributions by 
particular flux vectors. 


5.6.1 Principles of Pathway Analysis 


Extreme pathways/elementary flux modes are structural elements that are unique 
for a given network and can be considered as the smallest functional entities 
(Schuster and Hilgetag, 1994; Schilling et al., 2000). They both are defined by 
a flux vector e composed of q elements (e1, €2, ...€q), each describing the net rate of 
the corresponding reaction. The pathway represented by e can be identified by the 
utilized reactions. We denote this by 


P(e) = {i : e; 40}. (5.13) 


In other words, the pathway representation P(e) specifies all reactions that 
participate in the EP or EFM e. If e is an EFM or EP, it fulfills the following 
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three conditions (Schuster et al., 1999, 2000; Schuster and Hilgetag, 1994; Schilling 
et al., 2000; Schuster et al., 2002b): 


(C1) Pseudo steady state: None of the metabolites is consumed or produced in 
the overall stoichiometry according to equation 5.7. Hence, EP or EFM e is in the 
null-space of N and Ne = 0 holds. 


(C2) Feasibility: All fluxes in an EP or EFM have to be thermodynamically 
feasible, that is, irreversible reactions have to proceed in the “right” direction. 
Formally, this requires that all rates e; > 0 if reaction 7 € irrev. 


(C3) Non-decomposability: The central property of EPs and EFMs is that they 
represent the minimal functional units in a network. No reaction from an EP 
or EFM can be deleted, still resulting in a valid (non-trivial) steady state flux 
distribution. Formally, there exists no vector v unequal to the zero vector and 
to e fulfilling C1, C2, and that P(v) is a proper subset of P(e). This feature is 
also called genetic independence because C3 implies that the participating enzymes 
in one pathway are not a subset of the enzymes in another pathway. Cl and C3 
together ensure that the sub-network spanned by the reactions in pathway e is 
connected. 


Conditions C1—C3 completely define an EFM up to a scaling factor for each path- 
way. Note that C1 and C2 are identical to C1 and C2’ used in the constraint-based 
approach (section 5.5). For an EP, two additional conditions have to be satisfied (see 
section 5.6.2). Importantly, both approaches provide a unique decomposition of a 
given network structure into EPs or EF'Ms, respectively. Hence, they unambiguously 
represent a particular network. The small example network EN2 illustrates these 
basic properties of EFMs (figure 5.7). Only two EFMs can occur, namely one using 
the upper branch of the central reaction couple, and the other one using the lower 
branch. The third flux distribution is not an EFM because the irreversible reaction 
R2 operates in the backward direction, and thereby violates feasibility condition 
C2. The rightmost flux distribution violates condition C3; it can be decomposed 
into EM1 scaled by a factor of two and EM2. 
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Figure 5.7 Elementary flux modes in the example network EN2 (left), and two flux dis- 
tributions that do not constitute EFMs (right). Bold face denotes participating reactions 
and their (normalized) rates. 


The last example referred to a particular property of EFMs and EPs, namely 
convexity, which is of paramount importance for pathway analysis. The basic 
conditions C1—C3 imply that all feasible steady state flux distributions v can be 
described by a nonnegative superposition of all EFMs or all EPs, respectively. 
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With the complete set of EFMs in a network denoted by Sgrm and, in analogy the 
complete set of EPs being Sgp, this feature is formally represented by 


v= 5 ajel. (5.14) 
j 


Here, a; is a positive scaling factor (a; > 0), and ef denotes the j=th EFM 
€ Sperm or the j-th EP € Sgp, respectively. The pattern of superposition does 
not necessarily have to be unique for a given flux distribution, that is, different 
combinations of EFMs or EPs may lead to an identical flux pattern. Hence, in most 
cases a direct decomposition of a flux distribution into the underlying EFMs and 
EPs is not possible. 

Importantly, all edges—the so-called extreme rays—of the convex flux cone 
(section 5.5) are contained in the sets of EPs and EFMs, respectively (Schuster 
et al., 2002b), which directly follows from equation 5.14. In convex analysis, EPs 
and EFMs are called generating vectors of the convex cone. The concepts of EPs and 
EFMs were derived from a more general convex analysis approach to stoichiometric 
networks. There, pathways have been called extreme currents, but they were 
restricted to irreversible reactions (Clarke, 1988). EFMs permit all reactions to 
be reversible, while for EPs, this is allowed for certain fluxes (see below). 


5.6.2 Elementary Flux Modes and Extreme Pathways 


The conditions C1—C3 already uniquely determine the complete set of EFMs in a 
network (up to a scaling factor for each pathway vector). Two additional conditions 
delimit the EPs from the EFMs (Schilling et al., 2000): 


(C4gp) Network configuration: Reactions have to be classified either as ex- 
change fluxes, which allow a metabolite to enter or to exit the system, or as inter- 
nal reactions. All reversible internal reactions must be described by two separate 
irreversible reactions for the forward and the backward direction, respectively. Ex- 
change fluxes can be reversible and each metabolite may only participate at most 
in one exchange flux. 


(C5gp) Systemic independence: The set of EPs in a network configured ac- 
cording to condition C4gp is the minimal set of generating vectors, allowing to 
describe all feasible steady state flux distributions by equation 5.14. The network 
configuration (C4gp) ensures that the set of EPs is unique for a given network. 


Thus, extreme pathways are only defined in a particular representation of a given 
network. 

Reconfiguration and the particular conditions for EPs lead to the following 
consequences, which can be exemplified by EN1 (figure 5.8 and table 5.1): (i) 
Each split reversible reaction leads to a “two-cycle” constituted by the forward 
and backward branches, for instance, EM9’ in EN1 for R8. This type of pathway, 
however, has no practical meaning and is usually not further considered (Papin 
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et al., 2002). (ii) Except for these two-cycles, the EFMs in the original and the 
reconfigured network are equivalent. (iii) In a reconfigured network, the set of 
EPs is always a (proper or non-proper) subset of the EFMs because each EP 
obeys conditions C1—C3, that is, Sep C Spy. Each EP can be mapped onto 
a corresponding EFM, while the inverse is not true. For instance, EFMs1’-3’ 
(table 5.1) can be represented by non-negative linear combinations of EPs and, 
hence, are not systemically independent. (iv) Systemically dependent EFMs that 
are not EPs occur only when a network contains reversible exchange fluxes (Klamt 
and Stelling, 2003; Papin et al., 2004b) such as R2 in EN1. There, the direct pathway 
(EM1) can formally be decomposed into two pathways that rely on the reversible 
exchange flux of metabolite B (EM5’,8’ = EP2,5). 
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Figure 5.8 Elementary flux modes EM1-8 in the example network EN1. EFMs were 
grouped by the net conversion of external metabolites (bottom of each box) as indicated 
by different gray background levels. 


As the above examples indicate, the set of EFMs related to a network shows cer- 
tain conservation properties. When a reversible reaction is changed to irreversible, 
a new pathway set is obtained by excluding those EFMs from the original set that 
use the specific reaction in the forbidden direction (Schuster et al., 2002b). Hence, 
one can calculate the EFMs separately for forward and backward direction and 
then assemble the complete set of EFMs for the original network by uniting the 
two sub-sets. Likewise, if a reaction is deleted, the subset of EFMs not involving 
this reaction is the complete set of EFMs in the reduced network (Schuster et al., 
2000). In contrast, the set of EPs needs to be recalculated whenever a (partial) 


reaction is removed. 
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Table 5.1 Relations between elementary flux modes in the original EN1 (EFMs; cf. 
figure 5.8), in EN1 after reconfiguration (EFMs’), and extreme pathways (EPs). 





EFMs EFMs’ EPs Sum of EPs 











EM1 EM?’ = EP2 + EP5 
EM2 EM?’ — EP3 + EP5 
EM3 EM3’ — EP2 + EP4 
EM4 EM4’ EP1 = 
EM5 EM5’ EP2 = 
EM6 EM6’ EP3 — 
EM7 EMT’ EP4 = 
EM8 EM8’ EP5 — 

= EM9’ EP6 = 








5.6.3 Calculation of Pathway Sets 


Several algorithms have been proposed for the enumeration of pathways (Schuster 
et al., 2000; Wagner, 2004). They contain a common core (Gagneur and Klamt, 
2004) shown as pseudo-code in figure 5.9. Pathway sets, stored in matrices Mt, are 
built iteratively by successively processing the imposed equality (C1), inequality 
(C2), and elementarity (C3) constraints . An initial matrix M° can be derived from 
N, for example, using a special kernel matrix K of N (Wagner, 2004); in this aspect, 
existing algorithms differ most. Until all constraints are satisfied, the rows in MË, 
which represent preliminary pathways, have to be processed for compliance with 
conditions C1—C2. Thereby, new candidate pathways are generated by Gaussian 
combination of pairs of rows in Mt. Additionally, computationally expensive tests 
have to be performed to comply with C3. 


Figure 5.9 Pseudo-code for pathway calculation. 


Construct initial matrix M° from N 
for all constraints of C1/C2 not satisfied 
Process current constraint for all rows in M’ 


M+! = Pairwise Gaussian combinations of rows of Mt 
Test for elementarity of all candidate pathways (C3) 


Mt = M+! 
end 
EFMs = M‘*! 


These requirements render the combinatorial problem of pathway identification 
NP-hard. With increasing network size, the number of pathways and the associated 
computational costs are likely to grow more than linearly (Klamt and Stelling, 
2002). Therefore, pathway analysis has mainly been applied to networks of small or 
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moderate size. With algorithmic improvements, however, the networks investigated 
became increasingly more complex (Stelling et al., 2002; Förster et al., 2002; Wiback 
and Palsson, 2002). We will discuss prospects for EFM/EP computation in genome- 
scale networks in section 5.7. 


5.6.4 Applications of Pathway Analysis 


Pathway analysis, which comprises the approaches of elementary flux modes and 
of extreme pathways, per se aims at the dissection of complex networks into 
smaller functional units. Here, we consider how these entities help in understanding 
metabolic networks by focusing on EFM analysis because for most applications, the 
sets of EPs and EFMs are identical. However, the slight differences in the methods 
may have important consequences for specific applications (Klamt and Stelling, 
2003). 

EFM analysis can be used to identify all routes that enable a cell to convert a 
certain substrate into a product. In EN1, for instance, four genetically independent 
routes (EM1-4) produce P from A as sole substrate (figure 5.8). Purely internal 
reaction cycles without net energy consumption, in contrast, would point to ther- 
modynamic inconsistencies (Beard et al., 2002). Since all possible steady-state flux 
distributions are linear combinations of EFMs, the pathway(s) that are optimal 
or sub-optimal regarding the ratio of two reaction rates have to be among these 
units. In the example network, the two EFMs with highest P:A yield (R3/R1) of 
one (EM1,2) correspond to the two qualitatively different optimal routes in sec- 
tion 5.5.2, figs. 5.5 and 5.6). However, FBA allows including additional constraints 
such as maximal reaction capacities when searching for optimal flux vectors. Path- 
way analysis uncovers all qualitatively different (potentially optimal) pathways, the 
superposition of which gives the actual flux distribution observed in vivo. Poten- 
tial contributions of individual pathways to this flux distribution may be analyzed 
through the spectrum of a-values in equation 5.14 (Wiback et al., 2003). 

As pathway analysis yields all possible routes, the importance of single reactions 
for the network behavior in a certain context can be analyzed. For instance, reaction 
R9 in EN1 is indispensable for the production of P from B alone, but several 
alternative routes without R9 exist for the conversion of A to P. Similarly, correlated 
reactions (see sections 5.4.2 and 5.5.2) can be dealt with. The number of reactions 
in a pathway might be of interest because it indicates the amount of cellular 
resources that is needed to establish the pathway, for instance, to provide for the 
necessary enzymes. Moreover, the distribution of pathway lengths can characterize 
the complexity of a given network or differences between seemingly similar networks 
in two organisms. 

The analysis of network functionality directly relates to the conservation prop- 
erties of EFMs. When a reaction is removed from a network, the new set of EFMs 
contains all those EFMs of the original network in which the specific reaction does 
not participate. An empty set for the perturbed network, hence, indicates that the 
organism is structurally unable to achieve a steady-state flux distribution. This is 
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a reliable predictor of inviability for the corresponding gene deletion mutant (Ed- 
wards et al., 2001a; Stelling et al., 2002). The concept of “minimal cut sets,” the 
smallest sets of reactions the inactivation of which will guarantee network failure, 
systematically extends these analyses; it allows to search for optimal intervention 
strategies (Klamt and Gilles, 2004). Likewise, the introduction of new genes can be 
assessed. 

The analysis of network robustness /flexibility may be performed by assessing the 
effects of all possible mutations. Here, however, it may be important to analyze the 
complete (reduced) sets of EFMs in order to investigate the effects of pathway 
redundancy, or the sensitivity of network performance in terms of yields upon 
perturbations. For instance, such an analysis would show that production of P 
from substrate A alone (four alternative pathways EM1-4) likely is less affected 
by random mutations than the production of P from exclusively B (one route 
EM8; figure 5.8). Hence, MPA represents a suitable approach for extracting a 
large number of structural features from a given network, but it is limited by the 
increasing combinatorial complexity in larger networks. 





5.7 Advanced Topics and Future Directions 


The most challenging fields in stoichiometric network analysis concern predomi- 
nantly (i) analyzing networks of increasing complexity, (ii) decomposing networks 
into modules and hierarchies, and (iii) incorporating and predicting cellular regu- 
lation (Price et al., 2004; Stelling, 2004). 


5.7.1 Genome-Scale Network Analysis 


The first task in genome-scale network analysis is network reconstruction from 
genomic, biochemical, and physiological data. Stoichiometry, directionality, and 
catalyzing enzymes (and their genes) of organism-specific metabolic (sub-)networks 
can now be obtained from databases such as KEGG (www.genome.ad.jp, Goto et al. 
(2002)) or MetaCyc (www.biocyc.org, Karp et al. (2002)). Unknown reactions and 
the necessary validation of database entries, however, pose challenges for model 
development. To date, genome-scale stoichiometric models have been established 
mainly for microbial model organisms. With up to ~1,200 reactions and ~700 
metabolites, they belong to the largest models of cellular systems known so far 
(Price et al., 2004). 

The analysis of such complex networks is straightforward for FBA, which requires 
only linear optimization (Edwards et al., 2001a). Pathway analysis, however, has 
to deal with a combinatorial explosion of possible routes with increasing network 
complexity. EFM analysis in a model of E. coli central metabolism that comprised 
only 89 metabolites and 110 reactions yielded up to half a million pathways (Stelling 
et al., 2002) (see chapter 2). However, this number is far below the theoretical upper 
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bound of 4.39 - 10?! for that network size, which is typical for highly structured 
cellular networks (Klamt and Stelling, 2002). Despite the progress in pathway 
computation (section 5.6.3), it seems still impossible to calculate all pathways 
directly in genome-scale networks. Restricting the number of simultaneously active 
system inputs (substrates) and outputs (product pathways) (Papin et al., 2002; 
Price et al., 2002) allows to describe many situations in practice, but it does 
not provide an assessment of all organismic capabilities. Alternative approaches 
aim at decomposing complex networks into biologically meaningful modules and 
hierarchies, which is closely related to the study of general design principles in 
cellular networks. 


5.7.2 Modularity and Hierarchies 


Modules are semi-autonomous entities that show dense internal functional connec- 
tions, but looser connections with their environment. They occur at all levels of 
cellular organization (and beyond), for instance, as metabolic pathways, or as an- 
abolism and catabolism at higher levels (see chapter 3). Modularity and hierarchies 
are directly linked because in general, smaller modules combine into larger modules 
of the next layer. Their biological relevance lies in the possibility to evolve, main- 
tain, and coordinate cellular functions effectively because changes in one module 
primarily affect this entity and do not (unintentionally) spread through the network 
(Lauffenburger, 2000; Oltvai and Barabasi, 2002). 

Graph theory has been the method of choice for uncovering modules and hi- 
erarchies in genome-scale networks in various organisms. For metabolic and tran- 
scriptional networks, several studies yielded a surprising overlap of the identified 
modules with “classical” biochemical entities, but also divergences (Ravasz et al., 
2002; Holme et al., 2003; Gagneur et al., 2003; Ihmels et al., 2004b). Consequently, 
formal approaches have been proposed for graph-based network decomposition and 
subsequent stoichiometric analysis (Schilling and Palsson, 2000; Schuster et al., 
2002a). For example, metabolites with higher connectivity numbers can be con- 
sidered as external to obtain “local” EFMs in small subnetworks (Schuster et al., 
2002b). Alternatively, one may consider subnetworks by neglecting reactions; the 
resulting EFMs will be valid for the complete network and, thus, approximate its 
capabilities. However, because graph-theoretical approaches use only little biolog- 
ical knowledge and, consequently, roughly represent reality, it would be desirable 
to employ other structural approaches for this type of analysis. First attempts into 
this direction rely on correlated reaction sets, which correspond to enzyme subsets 
for perfectly coupled reactions (Papin et al., 2004a). Pathway analysis could provide 
starting points for future methods because per se it aims at identifying functional 
subunits in complex networks. In fact, EPs and EFMs may represent overlapping 
modules. Sound theoretical criteria for the demarcation of modules from pathway 
structures, however, still have to be developed. 
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5.7.3 Network Structure and Cellular Regulation 


Analysis of cellular regulation in the SNA framework is at an early stage. Different 
objectives distinguish three broad classes of approaches. Starting from regulation 
as an additional constraint on network function, known regulatory interactions 
were incorporated into FBA (Covert et al., 2001) and into EP analysis (Covert 
and Palsson, 2003). Only those flux distributions were allowed that complied 
with regulatory rules superimposed onto the stoichiometric model. A simple, yet 
realistic control model uses logical rules, for instance “ZF the favored substrate is 
available THEN uptake of less preferred substrates is suppressed.” This “dynamic” 
FBA improved predictions of mutant phenotypes of E. coli for a large variety of 
conditions (Covert et al., 2004). Even such coarse descriptions of regulation may 
serve as powerful constraints because they eliminate the majority of structurally 
possible pathways (Covert and Palsson, 2003). 

A second class of approaches aims at inferring regulatory features from network 
structure. It assumes that evolution established regulatory circuits that are adapted 
to the network they control. Inference of regulation from network structure may be 
possible because the underlying regulatory logic could be relatively simple compared 
to the networks’ complexity (Lauffenburger, 2000). Enzyme subsets and correlated 
reactions help to qualitatively predict the relative control of fluxes. Singular value 
decomposition (SVD) of pathway matrices has been proposed to define the most 
important “eigenpathways” that could approximate the functionalities of a network 
and thereby unravel potential key control points. First evidence from human red 
blood cell metabolism supports this claim (Price et al., 2003). The pathway-based 
concept of “control-effective fluxes” allows to estimate gene expression ratios under 
different growth conditions solely from network structure (Stelling et al., 2002; Cakir 
et al., 2004). Analysis of E. coli central metabolism pointed to a different control 
logic of gene expression (long-term flexibility) versus regulation at the enzyme level 
(fine-tuning of fluxes in a specific situation) (Stelling et al., 2002). 

Finally, the direct application of SNA approaches to regulatory networks has just 
begun. Examples include the formulation of stoichiometric models for gene expres- 
sion (Allen et al., 2003). Extreme pathway analysis was extended to characterize 
information flows in signal transduction (Papin and Palsson, 2004) and in gene reg- 
ulatory networks (Xiong et al., 2004). Approaches like these may become important 
for elucidating crosstalk between signaling systems or for yielding functional cycles 
involved in signal propagation and resetting. It has to be noted, however, that reg- 
ulatory processes are often characterized by their dynamics, which cannot easily 
be captured by SNA methods. Here, as in the other fields described above, the 
establishment of new approaches and the testing of existing ones against biological 
data are necessary for the further development of stoichiometric network analysis. 
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5.8 Conclusions 


Stoichiometric and constraint-based modeling provides powerful methods for the 
characterization of, in particular, metabolic networks. It can form a basis for more 
detailed dynamical modeling of such systems. However, none of the approaches 
we discussed is able to adequately address all the potential applications of SNA 
(table 5.2). 


Table 5.2 Approaches for stoichiometric network analysis, their requirements, and fields 
of application. Parentheses denote partial applicability. 
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Approach Stoi- Ther- Quasi Reac- Opti- Com- Flux 
chio- mo- steady tion mality puta- dis- 
metry dyna- state capa- tional tribu- 

mics cities costs tion(s) 

Graph theory  (—)@ (+) - - - Low None 

CRs + — 7 = — Low None 

Kernel matrix + = + = = Low All 

MFA + (+) (+) - Low Single 

FBA + + + + + Low Single 

MOMA + + H H Medium Single 

EFMs/EPs + + F = = High All 

Applications 
Func- Opti- Reac- Reac- Path- Net- Ro- 
tional mal tion tion way work bust- 
path- opera- impor- cor- length func- ness 
ways tion tance rela- tion 
tions 

Graph theory - = (+) ©) (+) (+) (+) 

CRs — = — = 7 = E 

Kernel matrix — - - (+) - (+) - 

MFA — = (+) = = = = 

FBA 5 + (+) = = a (+) 

MOMA F + (+) = = F (+) 

EFMs/EPs? + + + + 











a Graph-theoretical methods use only connectivities and, possibly, directions. 
b For the realistic case of equivalent sets of EPs and EFMs. 


Hence, the methods for tackling a specific problem have to be carefully selected. 
More specifically, FBA and related approaches are most suitable for finding partic- 
ular flux solutions even in genome-scale networks. Pathway analysis delivers a mul- 
titude of structural and functional aspects but is, in very large networks, hampered 
by combinatorial complexity. Despite such limitations, we expect the importance 
of SNA for systems biology to increase, particularly for an effective initial charac- 
terization of large-scale systems. We anticipate that the field will move towards a 
closer connection of the analysis of network structures in metabolism and regula- 
tion, which requires the development of new or modified theoretical methods. 


6 Modeling Molecular Interaction Networks 
with Nonlinear Ordinary Differential 
Equations 


Emery D. Conrad and John J. Tyson 


Cellular processes, like growth, division, motility, and death, are controlled by 
complex networks of interacting macromolecules (genes, mRNAs, and proteins). 
These networks are sets of chemical reactions that convert reactant species into 
product species at rates that depend on reactant concentrations and, often, on 
the concentrations of other molecules (enzymes, inhibitors, transcription factors). 
To a first approximation, a reaction network can be described mathematically 
by a set of nonlinear ordinary differential equations that track the effects of 
these simultaneously occurring reactions. To gain some insight into the dynamical 
possibilities of such networks, we explore a set of increasingly more complicated 
network motifs, describing their effects in terms of signal-response curves. From 
our collection of simple functional motifs (buzzers, fuses, toggle switches, and a 
variety of oscillators) we can create realistic models of control systems actually 
employed by cells. As an example, we discuss the DNA-damage response pathway 
in mammalian cells. 





6.1 Introduction 


Molecular biologists often rely on suggestive cartoons to capture the complex 
interactions between many molecular components in functional networks of genes, 
proteins, and metabolites. In such cartoons (for example, figure 6.1), icons represent 
the interacting molecules and solid arrows their chemical transformations, for 
example, synthesis, degradation, phosphorylation, dephosphorylation, binding, and 
dissociation. Enzymatic and other indirect effects (such as allosteric activations 
or inhibition) are often represented by dashed arrows. These cartoons (or “wiring 
diagrams”) are useful in summarizing many experimental observations, in capturing 
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the way biologists think about molecular mechanisms, and in suggesting new 
experiments to test or extend this molecular understanding of cell physiology. 
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Figure 6.1 Example wiring diagram (reproduced from (Ciliberto et al., 2005), with 
permission). Intracellular proteins (p53, Mdm2, etc.) participate in a network of chemical 
reactions (solid arrows), such as synthesis, degradation, phosphorylation, and ubiquitina- 
tion. Dashed arrows represent catalytic or regulatory effects on a reaction. This wiring 
diagram is a hypothetic mechanism (Ciliberto et al., 2005) for the interactions between 
p53, a transcriptional activator involved in cell cycle arrest and apoptosis, and Mdm2, a 
protein involved in degradation of p53. Mdm2 catalyzes the ubiquitination of p53, and 
polyubiquitinated p53 is rapidly degraded. Two feedback signals govern the behavior of 
the reaction network: (1) p53 stimulates the synthesis of Mdm2 in the cytoplasm, and 
(2) p53 indirectly inhibits the transport of Mdm2 into the nucleus. In response to DNA 
damage, the degradation of Mdm2 in the nucleus is upregulated. (IR = ionizing radiation) 


Although most cell biologists use molecular wiring diagrams in these informal 
ways, we would like to pursue the idea that a reaction network is fundamentally 
a complex dynamical system and that its wiring diagram instructs how the con- 
centrations of all the interacting components will change over time as the chemical 
reactions play out within the cell. From this point of view, the next question is 
how best to capture the dynamics of the network in mathematical form, in order 
to analyze and simulate its behaviors and ultimately to use the model to answer 
real physiological questions. For the purposes of this chapter, we will use nonlin- 
ear ordinary differential equations (ODEs) to represent the dynamical properties of 
reaction networks. 

Realizing a reaction network as a system of ODEs is based on two assumptions. 
First, that our system is a “well-stirred” chemical reactor, so that component 
concentrations don’t vary with respect to space. This is a reasonable assumption 
for cell-free extracts, but it hardly seems appropriate for an intact cell. Whether 
it is a good approximation or not depends on the time and space scales involved. 


6.1 Introduction 
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In box 6.1, we show that molecular diffusion is sufficiently fast to mix proteins 
throughout a yeast-sized cell in less than a minute. If we are interested in cell cycle 
processes (time scale = hours) or circadian rhythms (period = 1 day), then the 
“well-stirred” assumption is justified. If we are interested in membrane oscillations 
(time scale = seconds), then the “well-stirred” assumption would not be advisable. 

When spatial information is required, then partial differential equations (PDEs) 
would be indicated. We will not discuss modeling by PDEs in this chapter, but note 
that, before one can appreciate the special properties of nonlinear PDE models (see 
chapter 9 of this book and Murray (2002b)), one must first master the principles 
in this chapter. 








Box 6.1: How fast is molecular diffusion? 

Given the typical diameter of a cell to be 107? cm and a typical diffusion constant for 
a protein in aqueous solution to be D=10~ "cm? /s, we can calculate the average time 
for a protein to diffuse across a cell to be: t = AA eak = 5s. If diffusion is 10-fold 
slower in cytoplasm, then the average time to cross a all is roughly 1 min. These are 
expected “mixing times” for macromolecules in cells. Metabolites (small molecules) 
will mix on a faster time scale. 





The second basic assumption is that the variables (chemical concentrations) are 
continuous functions of time. This assumption is valid if the number of molecules 
of each species in the reaction volume (the cell or subcellular compartment) are 
sufficiently large (say, thousands of molecules each, at least). For concentrations 
greater than about 10 nM, we are safe using ODEs (see box 6.2). 








Box 6.2: How many molecules of a regulatory protein in a cell? 

A spherical cell of diameter 107° m has a volume of roughly 0.5 x 107" m? = 5 x 
107}? L. Given a typical concentration of a specific regulatory protein to be 10 nM, 
we calculate 1078 Mol x 6 x 1023 molecules y 5 x 10713 -L 3 999 molecules poy a 
reaction volume containing 3,000 molecules, we are justified in using ordinary differ- 
ential equations to describe changes in a continuous variable X(t) = concentration of 
species X. Were the concentration to drop below 1 nM, we would need to reformulate 
the model in terms of stochastic variables to capture the effects of molecular noise in 
the dynamical system. 








If the total number of molecules of any particular substance, say, a transcription 
factor, is less than 1,000, then a stochastic differential equation or a Monte Carlo 
model would be more appropriate (Rao et al., 2002; McAdams and Arkin, 1999). 
Stochastic modeling is much more difficult than ODEs and requires a preliminary 
understanding of the deterministic dynamical system. For this reason, it makes sense 
to limit this chapter to ODE modeling and leave the harder stuff to chapter 8. 

Granted these two simplifying assumptions, then ordinary differential equations 
are a very useful language in which to express mathematically the dynamical 
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consequences of a molecular interaction network. By applying a set of simple 
rules, we can express an arbitrarily complex reaction network as a set of coupled 
differential equations. The computer can then keep track of all the complex, 
interweaving interactions in the network and tell us with great precision what 
are the consequences of the mechanism that purports to describe some aspect of 
cell physiology. In this sense, kinetic modeling by differential equations is a tool 
for hypothesis testing (see chapter 1). If the mathematical consequences of the 
mechanism do not agree with observations, we must search for the problems in our 
hypotheses. If the consequences agree with the observations, then we can have some 
confidence that we are on the right track to understanding the mechanism. 

We assume the reader has no familiarity with how to do kinetic modeling 
of chemical reactions beyond some vague (and possibly regretful) memories of 
the Michaelis-Menten equation. We start with the basic idea of using a simple 
rate law to describe how fast a chemical reaction proceeds and show how to 
estimate kinetic rate constants for isolated reactions from data. Then we assemble 
a few simple reactions (for protein synthesis, degradation, phosphorylation, and 
dephosphorylation) into modules for chemical buzzers, switches, and oscillators. 
These reaction motifs can then be linked together to form more complicated and 
realistic control systems. Writing the differential equations describing these systems 
can be largely automated, and solving the equations can be fully automated (see 
chapter 16). Fitting the results to experimental data and estimating rate constants 
are difficult tasks, which are the subjects of active research (chapter 11). We shall 
touch on all these issues in what follows. 





6.2 Basic Building Blocks 
6.2.1 From a Wiring Diagram to a Set of ODEs 


To get from a wiring diagram to a set of ODEs, we must think about a network as 
a dynamical system whose state is changing from one moment of time to the next. 
We assign to each species (or icon) in the diagram a single state variable, X(t) 
= the concentration of species X. The collection of values of all these variables 
{X1(t), X2(t), X3(t),...} at any point in time constitutes the state of the system. 
Then, for each molecular species, we write a differential equation that describes how 
its concentration changes over time due to its interactions with the other species in 
the network. For example, for species X, we write 


dX 
a synthesis — degradation — phosphorylation 


+ dephosphorylation — binding + release, etc. (6.1) 
The rate of each reaction (synthesis, degradation, etc.) must be represented by a 


kinetic rate law, which will have one or more rate constants associated with it. By 
assigning specific values to these rate constants, we fine-tune general rate laws to 
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particular reactions. The set of all rate constants needed to describe the reactions 
in a molecular interaction network is called the parameter set {p1, p2, . . -Pm } of the 
model. 
In this paradigm, the dynamical consequences of a reaction network are deter- 
mined by a system of nonlinear ordinary differential equations, 
dX; 


ae = F;i(Xı, X2, wy Xn} DDA sa Din) i= 1, 2, e TL (6.2) 


The ODEs are nonlinear because the rate laws on the right-hand sides of equa- 
tion 6.2 are often nonlinear functions of the state variables. Notice that the ODEs 
tell us how each state variable is changing with respect to time; they do not tell 
us the value of X at any specific time t. To solve the differential equations is to 
find these functions, X;(t), for each species (7) in the network. Each function cor- 
responds to a measurable property of the system, the time course of species i. In 
order to solve equation 6.2 for the time courses X;(t), we must first prescribe a set 
of initial conditions {X1(0), X2(0),...,X,(0)}. The combination of rate equations, 
initial conditions, and parameter values is called a well-posed initial value problem 
(IVP), and its solution is guaranteed by a famous theorem stated informally in 
box 6.3. 





Box 6.3: Existence and uniqueness theorem 

Given very weak conditions on the smoothness of the rate laws on the right-hand side 
of equation 6.2, conditions that are usually satisfied by realistic models of reaction net- 
works, the initial value problem has one and only one solution {X1 (t), X2(t),..., Xn(t)} 
for all 0 < t < oo. By running time backwards, we can also find a unique prehistory 
of the system (for —oo < t < 0). 














Box 6.4: Linear and nonlinear differential equations 

If the F;’s in equation 6.2 are linear functions of the variables, X1, Xe2,..., Xn, then 
much can be said about the dynamical characteristics of the reaction system. The 
good news is that the solution can be expressed analytically in terms of exponential 
functions, exp(A;t), and harmonic functions, sin(w;t + ¢;). The bad news is that the 
dynamical possibilities of a linear system are very impoverished. In general, there can 
be only a single steady state solution, and all other solutions either approach this 
steady state as t — oo or they blow up (some X; — oo as t — oo). Linear systems 
show none of the interesting dynamical behaviors (multiple steady states, limit cycle 
oscillations) to be described later in this chapter. The interesting dynamical features 
depend crucially on nonlinear dependencies of the F;’s on the X;’s. 











We can imagine three types of “solutions” of a system of ODEs. 


1. Analytical. Under very special circumstances (see, for example, box 6.4), it is 
possible to write the solution of a set of ODEs in terms of elementary functions, 
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such as Xy(t) = X1(O)e~*!, Xo(t) = Xo(0)[1 + 0.5sin(kyt)], X3(t) = ..., where 
ka, kp, ... are rate constants and X1(0), X2(0),... are initial values. 

2. Numerical. It is always possible to solve a well-posed IVP numerically on a 
computer. In principle, we can write 


Xit + At) = X;(t) + F(X (t), Xo(t), ..., Xn (t)) - At (6.3) 


for each i. By starting at X;(0) and taking sufficiently small steps, At, we can “walk 
along” the time course to any time ¢ in the future (or in the past). In practice, there 
are much more sophisticated, efficient, and accurate numerical schemes for walking 
along the time course (see chapter 16). 


3. Qualitative. Whereas numerical integration of the ODEs gives us quantitative 
information about the solution (which is necessary if we are trying to account for 
quantitative experimental data), sometimes we are more interested in answers to 
qualitative questions, like, “What will the network do if I wait for a sufficiently long 
time?” (that is, characterize the solutions—the “stable attractors”’—of the ODEs as 
t — oo) or “How will the long-term behavior of the network change if I double the 
rate of synthesis of protein X?” (that is, characterize the dependence of the stable 
attractors on any parameter in the ODEs). 


To explore the examples that we will present, we suggest that the reader download 
XPPAUT or one of the other tools for simulating dynamical systems listed in the 
appendix. 

Many of our qualitative methods depend on identifying and characterizing the 
steady state solutions of equation 6.2. A steady state solution is a set of constants 
{X}, X3,...,X7} for which the net rate of change of every variable is zero, that 
is, F;(X{, Xž,..., Xž) = 0 for all ¢ = 1,2,...,n. A steady state is a special time- 
invariant solution of the ODEs, where the reactions producing and consuming each 
species perfectly cancel each other. Steady states can be either stable or unstable. 
Stable steady states attract all nearby solutions, whereas unstable steady states 
repel some nearby solutions as time increases. 


6.2.2 Constant Synthesis 


For starters, let’s consider a constant rate of synthesis of some macromolecule, which 
can be described by the initial value problem ux = kı, X(0) = Xo. In this case, the 
differential equation is simple enough that we can guess the solution of the initial 
value problem: X(t) = Xo + kit. The numerical value of the rate constant must be 
estimated from experimental data. For example, from observations of accumulating 
cyclin in a frog egg extract (figure 6.2), we estimate that kı = 1nM/min. 

X(t) = Xo + kıt is an example of an explicit, analytical solution. The uniqueness 
part of the theorem in box 6.3 assures us that once we have guessed a solution to 
the initial value problem, it is the only solution. We can sleep soundly at night, 
assured that we have not overlooked some other solution of this dynamical system. 
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Figure 6.2 Experimental data used to estimate kinetic rate constants. a) Accumulation 
of cyclin (filled circles) in a frog egg extract; degradation of cyclin in interphase cells 
(open squares; (Felix et al., 1990)) and in metaphase cells (filled squares, from (Tang 
et al., 1993)). b) Formation of dimers of Cdk1 and cyclin B in an extract for which the 
initial concentration of Cdk1 monomers was approximately 100nM (Kumagai and Dunphy, 
1995). 


Once we have the solution, we can ask, “What happens to X(t) as t > oo?” 
Well, it appears that the concentration of X grows without bound. We get this 
undesirable result because there is no term to counteract the growth rate in the 
differential equation. 


6.2.3 Linear Degradation 


Biochemical molecules naturally experience decay or degradation, and the rate at 
which this happens depends on how much of the molecule is present. In math- 
ematical terms, “* = —kəX, X(0) = Xo. The unique solution to this initial 
value problem is X(t) = Xoe~*?'. An interesting property of exponential decay 
is that X disappears with a constant half-life, t,/2, defined by X(t1/2) = 5X0. 
For linear degradation, t)/2 = m2, From the data on cyclin degradation in fig- 
ure 6.2, we see that cyclin is disappearing with a half-life of about 10 minutes, 
hence kə S 0.07 min ~. 

At this point, the reader should consider what happens when we combine a 
constant rate of synthesis with linear degradation. That is, what is the analytical 
solution of the initial value problem: 4% = kı — k2X, X(0) = Xo? From the exact 
solution, show that X(t) —> kı/k2 as t > oo, for any Xo > 0. 


6.2.4 Autocatalytic Production 


Autocatalysis is a process whereby a molecule activates its own production, ei- 
ther directly or indirectly through intermediates. In molecular biology, important 


examples include DNA synthesis and ribosome biogenesis. The simplest equation 


expressing autocatalysis is ux = k)X. This is identical to the equation of the pre- 
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vious subsection, except for a difference of sign. The solution is X(t) = Xoe". 
In this case, the solution grows with a constant doubling time, te = m2, We’ll see 
more complex, indirect autocatalytic effects when we discuss feedback, later in the 
chapter. 


6.2.5 Dimerization 


Another fundamental reaction in biochemical networks is dimerization, where two 
species combine to form a complex. Examples include enzymes binding substrates, 
and the successive steps in the formation of hemoglobin, a four subunit heteromer 
(a2ß2). According to the Law of Mass Action, dimerization proceeds at a rate 
proportional to the product of the concentrations of the two binding species. Hence, 
we can express C binding X, forming the complex M, by the following scheme 


reaction C +X >M 
initial concentrations Co Xo 0 
extent of reaction -M -M M 
concentrations at a later time Co- M Xo- M M 
dM 1 
—_—=kCX =k — M)\(Xo — M k3] = ———— A 
T 3C 3(Co )(Xo ), [ks] CM mn (6.4) 


where we’ve chosen to write C(t) and X(t) in terms of M(t) so that we have a 
single, solvable equation for the unknown function M(t). The notation [ks] means 
“the units of k3.” 


Now, guessing a solution to this equation requires a bit more imagination. Let’s 
CoXo0(1—e7*) 

Co—Xoe-% >? 
where a = k3(Co — Xo), solves the initial value problem, when M(0) = 0, as in 


the scheme above. We can verify this claim by differentiating and doing a bit of 
algebra: 


suppose that we receive a mysterious letter claiming that M(t) = 





d CoXo(1 = e=) aCoXo(Co = Xo)e—%# 
=a = 2 (6.5) 
di y Comge (Co — Xoe~**) 
and 
Xo(1 — emat) Co(1 — ee) 





ke(Co— M)(Xo-M) = BON- GS) Xo - GAS) 
aCoXo(Co — ef a (6.6) 
(Co — Xoe) 





Remember that once we have a solution (even if it comes in the mail), it is the only 
solution we ever need (thanks to the existence and uniqueness theorem in box 6.3). 

Notice from the analytical solution, M(t) = — where a = k3(Co—Xo), 
that, if Co > Xo, then a > 0 and M(t) —> Xo as t — oo. On the other 


hand, if Co < Xo, then a < 0 and M(t) > Co as t — ov. In either case, the 
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asymptotic concentration of the complex is the initial concentration of the subunit 
in short supply. In principle, this conclusion is incorrect, because we have neglected 
dissociation of the complex (M — C + X, with some rate constant k_3). 

In order to estimate the rate constant from the data in figure 6.2b, we notice that 
Co = 100 nM and Xo & 40 nM (why?). Considering that it takes about 3 minutes 
for M(t) to reach 20 nM, we can solve M(3) = 20 for a: 


— 4000(1 — e~8*) 








M(3) = =2 
(3) 100 — 40e-3« : oD) 
=> 200 -— 200e~°* = 100 — 40e~ 3% (6.8) 
5 1, 8 
3 i a gine min! (6.9) 


Therefore, we estimate that k3 = 745 ln $nM~'min~! = 2.6 x 10-$nM~'min“!. 
6.2.6 Michaelis-Menten Kinetics 


The diagram in figure 6.3a represents the enzymatic transformation of substrate X 
into product P. Michaelis and Menten (1913) and Briggs and Haldane (1925) first 
explored the elementary reaction mechanism (figure 6.3b) for this process. Assuming 
that the total enzyme concentration Er is much less than the initial substrate 
concentration, Xo, they showed that the rate of the enzyme-catalyzed reaction can 
be written as: ue. = ue = ee , where Km = ve is called the Michaelis 
constant. Note that [Km] = nM. A rigorous derivation of the Michaelis-Menten rate 


law can be found in (Murray, 2002a), and in (Segel, 1988). 








a) b) 
E 
: X+ES=C—> P+E 
xX —— P 


Figure 6.3 Michaelis-Menten kinetics. a) Enzyme E catalyzes the conversion of sub- 
strate X into product P. b) Michaelis-Menten mechanism for an enzyme-catalyzed reac- 
tion: E binds the substrate X to form a complex C; in the complex, E converts X to P; 
once the conversion is done, E dissociates from P and is free to bind another molecule of 
substrate. 


Among other things, the Michaelis-Menten rate law can be used to reduce the 
number of variables which describe a typical enzymatic conversion process, such as 
phosphorylation or dephosphorylation. This reduction is often useful when trying to 
understand the dynamic possibilities of a network using analytical and qualitative 
methods. On the other hand, one must keep in mind the assumption (Er < Xo) 
so that the rate law is applied in a consistent fashion. 
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6.3 Simple Networks and Signal-Response Curves 


The basic rate laws just described can be combined to form reaction motifs that are 
commonly found in biochemical networks. These motifs have specific characteristics 
that dominate their behavior within larger networks. In order to build some 
dynamical intuition that may be useful in understanding larger, more realistic 
macromolecular networks, we first explore the properties of some common network 
motifs. 


6.3.1 Synthesis and Degradation 


Our first motif is simultaneous synthesis and degradation (figure 6.4a), described by 
aX = ky S —k2X, with X(0) = 0. In this equation, we might think of S (“signal”) as 
the concentration of mRNA encoding protein X. Notice that [kı] = [kə] = min7'. 
The solution of this ODE, which was posed as a problem earlier in the chapter, is 
X(t) = mS (1 — et) Notice that as t > oo, e~*2* — 0, and our solution tends 
towards the value Xss = as, Notice also that kS — k2Xss = 0, so Xss is the 


steady state solution of the differential equation, as described earlier. 


a) b) 


X 
Slope = k,/k, 


oD 


> X > 





S 


Figure 6.4 A signal-response relationship. a) Signal S stimulates the synthesis of protein 
X. b) Linear response of steady state protein concentration to signal strength. 


If we think of S as an input signal (mRNA concentration) and X as the response 
(protein concentration), then this motif at steady state generates a linear signal- 
response curve, as depicted in figure 6.4b. 


6.3.2 Phosphorylation and Dephosphorylation 


Now suppose X is phosphorylated and dephosphorylated as depicted in figure 6.5a. 
Choosing to model both the forward and reverse steps using simple linear kinetics, 


we write ite = kıS(Xr — Xp) — koX p, where Xr is the total concentration of 


both phosphorylated and unphosphorylated forms of X (so that Xr — Xp = X), 
and S' is the concentration of the protein kinase. (The concentration of the protein 


phosphatase is absorbed into the value of k2.) Notice that [ki] = nM! min", 


[k2] = min‘. Solving Tie = 0 results in a single steady state solution, Xp ss = 


uae. which is a hyperbolic function of S(see figure 6.5b). This is called a 


hyperbolic signal-response curve. 
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Figure 6.5 Hyperbolic signal-response curve (see text). 


We can determine the stability of the steady state graphically by plotting axe 


as a function of Xp. Noting that trajectories lie along the x-axis, we see that for 
aXe > 0 (that is, wherever the curve is above the x-axis), the solution, Xp(t), 
moves to the right along the x-axis and for IE < 0 (where the curve is below the 
x-axis), the solution moves to the left. The curve crosses the x-axis at Xp ss, the 
steady state. The stability of the steady state is then obvious because X p(t) moves 
towards Xp ss along the x-axis (figure 6.5c). This method of determining stability 
can be applied to any single-variable system. 

Our assumption of linear kinetic rate laws implies that Xr is much less than the 
Michaelis constants of both the kinase and the phosphatase. If this is not the case, 
then we should use Michaelis-Menten rate laws. 











Figure 6.6 Sigmoidal signal-response curve (see text). 


In this case (figure 6.6a), the governing ODE is 


dX ky EKX , koEp(Xr —X) 
dt 7 Kmi +X Kn2+ Xr —-X ’ 





(6.10) 


where Xp — X = Xp, Ex and Ep are the total concentrations of kinase and 
phosphatase (taken to be constant in this equation), and Kmı and Km2 are the 
Michaelis constants. At steady state, we have 


k EkX _ kEp(Xr- X) 
Kmi +X i Km2+ Xr- xX 





(6.11) 
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or, after simplifying and scaling the relevant variables, 
uiz(J2 +1 -— x) =u(1—2)(J, + z) (6.12) 


where z = X/Xr, u =k ER, ug = koEp, Jı = Kmı/Xr, Jo = Km2/Xr. Using 
the quadratic formula, we can solve this equation for x as a function of u1, ug , J1, 
and J2. We get x = G (u1, U2, Ji, J2), where the Goldbeter-Koshland function, G, is 
defined as 

2uy Jo 
B+ VB? — 4 (uz — uz) uy J2 








G(u1, u2, J1, J2) = (6.13) 
where B = ug — uy + u2Jı + u1J2 (see Goldbeter and Koshland (1981)). In terms 
of the original variables, X,, is a sigmoidal function of the input signal Ex (see 
figure 6.6b), and so we call this a sigmoidal signal-response curve. The sigmoid 
becomes more and more switch-like as J; and Jọ become much less than 1. 

To confirm the sigmoidal character of the Goldbeter-Koshland function, it is 
easier to think of u as a function of x than x as a function of u,. Rearranging 
equation 6.12, we find that u1 = ue ei . Lz, As a function of x, this curve 
crosses the z-axis at x = 1 and z = —Jı and has vertical asymptotes at x = 0 
and « = 1 + Jo. For 0 < Jj, J2 « 1, the curve must have the shape illustrated in 
figure 6.6b. 

We can prove the stability of the steady state by the same graphical methods 
used for the case of linear reaction kinetics, but we omit the details. 











6.4 Networks with Feedback 
6.4.1 What Is Feedback? 


Biochemical reaction networks commonly contain feedback loops, for which the 
output of one reaction affects the progress of an upstream reaction. Feedback can 
be characterized as positive or negative, depending on the net effect of the inter- 
actions. When reaction networks have intertwined feedback loops, their dynamical 
properties can be exceedingly complex (see chapter 1 and chapter 2). 

We start our investigation of feedback loops with two-component interactions 
(figure 6.7), which can be categorized as negative feedback (6.7a and b), positive 
feedback (6.7c), or mutual antagonism (6.7d). Mathematically speaking, the effect of 
species X; on the rate of change of another species X;, ine = F;(Xq,..., Xn), is the 
partial derivative 2 x . The sign of this derivative determines whether the feedback 
is positive or negative. Naturally, this partial derivate need not be constant and 
may change sign based on the state and on parameter values, so classifying the 
effect isn’t always unambiguous. A chain of such effects makes a feedback loop if it 
starts and ends with the same species. 
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Figure 6.7 Three types of feedback are possible between two components: a) and b) are 
negative feedback, c) positive feedback, d) mutual antagonism. 


6.4.2 Negative Feedback 


We start with a simple example of negative feedback (figure 6.8). The phosphory- 
lated form of Y activates the degradation of X, and X is the kinase that phospho- 
rylates Y. In this case, we need at least two differential equations to characterize 
the system: 








dX 
dYp _— k3X(Yr —Yp) kaEYp (6.15) 
dt  Km3+tYrp—Yp KmatYp ` 


where Y = Yr — Yp is the concentration of the unphosphorylated form of Y, 
[X] = [Yp] = [S] = [E] = nM, [kı] = [ks] = [ka] = min“, [kə] = nM - min}, 
and [Kms] = [Kma] = nM. The equation for X is constant synthesis (proportional 
to S) minus degradation (proportional to Yp - X). The equation for Yp is just 
the case studied in the subsection 6.3.2. We know how each of these differential 
equations behaves in isolation, but what happens when they are coupled together? 








6.4.3 Phase Planes, Vector Fields, and Nullclines 


As described earlier, at any point in time to, the network must reside in a particular 
state, (X (to), Y (to)), which is just a point in the two-dimensional state space of the 
system of ODEs. For the case of a two-species network, the state space is called 
the phase plane. At each point in the phase plane, the differential equations define 
a vector that tells us which direction and how far the dynamical system will move 
over the next small increment of time, At. We can think of the phase plane as 
covered with little vectors, like the hair on the head of a new military recruit. This 
collection of vectors is called the vector field. A solution to the ODEs is just a curve 
that starts at some initial point and follows the vector field. 

The vector field in the phase plane is conveniently characterized by the X- and Y- 
nullclines, the curves for which the corresponding species’ time derivative is exactly 
zero. Along the X-nullcline, the vector field points north (N) or south (S) because 


4X = 0 (that is, no change in the east-west direction). Along the Y-nullcline, 
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Figure 6.8 Example of negative feedback. a) Wiring diagram. b) Phase portrait. 
Dashed curves: nullclines given by equations 6.16 and 6.17; solid curves: trajectories of 
equations 6.14 and 6.15. Parameter values are kı k2 k3 k4 1,8 Km3 
Kma = 0.1, Yr = 1, and E = 0.5. In b), trajectories spiral into a stable steady state at 
the intersection of the nullclines. 








the vector field points east (E) or west (W) because 4° = 0 (no change in the 


north-south direction). In the region between the nullclines, the vector field adopts 
one of four characteristic compass directions (NE, SE, SW, or NW). Wherever the 


nullclines intersect, the pair of ODEs has a steady state solution (both aX = 0 and 


dY 
<~=0). 
dt 
In the above example for negative feedback, the nullclines are: 
; kyS 
X-nullcline: ky S = koYp X > Yp = bX (6.16) 
2 


kgX(Yr—Yp) _  kaBYp 
Km3+Yr—-Yp  Kmat+Yp 
Kms Kma 
=> Yp = Yr - G(k3X, k4, E, —— 
P T (kg »hAL, Yp , Yp 


Yp-nullcline: 








) (6.17) 


where G is the Golbeter-Koshland function defined by equation 6.13. These curves 
are easily plotted on the phase plane (figure 6.8b) along with representative tra- 
jectories that point out how the system evolves with time given several different 
initial conditions. The X-nullcline is a hyperbola, while the Yp-nullcline is a sig- 
moidal curve with the switch point at X = hE . ee Of particular note is 

3 “Er m4 
how all trajectories seem to be sucked into the steady state. When this is the case, 
we call the steady state locally and globally stable. It is possible to be locally stable 


but not globally stable or to be locally unstable, as we shall soon see. 








6.4.4 Positive Feedback 


Figure 6.9 presents a simple example of positive feedback, where species X activates 
species Y (via phosphorylation) and the phosphorylated form of Y promotes the 
synthesis of X. 
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Figure 6.9 Example of positive feedback. Wiring diagram. 


One possible set of equations to describe this network is 








dX 
dYp _— —kaX (Yr — Yp) ks EYp (6.19) 
dt 7 Kma + Yr a Yp Kms F Yp l 














where [X] = [Yp] = [S] = [E] = nM, [kı] = [ko] = [k3] = [ka] = [ks] = min™t 
and [Kina] = [Kms] = nM. For this system of equations, the X-nullcline is Yp = 
(k3/k2)X = kıs and the Y p-nullcline is Yp = Yr , G(k4X, ks E, Kma/Yr, Kms/Yr) 
(plotted in figure 6.10a). Notice that as we increase or decrease S, the X-nullcline 
moves down or up, and there is a range of S$ values, S € (Se1, Sce2), for which 
the nullclines intersect in three places. The points at the end of this range, where 
the system changes from one to three steady states, are called saddle-node (SN) 
bifurcation points. For Se < S < S.2, we say that the system is bistable. 
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Figure 6.10 Example of positive feedback. a) Phase portrait. b) One parameter bi- 
furcation diagram. Solid curves: stable steady states; dashed curve in between: unstable 
steady states. For Se < S < Se2, the control system is bistable. Parameter values are 
kı k4 T; k2 0.8, k3 1:2: S 0:2, Kms = Kma = 0.05, Yr= i; and E = 0.5. In a), 
trajectories move away from the unstable steady state (in the center) to one of two stable 
steady states. 
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A good way to visualize this bifurcation behavior is to plot a one parameter 
bifurcation diagram with S on the abscissa and either the X or YP concentration 
for each steady state on the ordinate, as in figure 6.10b. In general, this is a compact 
way to visualize how the dynamics of a system depend on its parameters. In this 
particular case, the system exhibits hysteresis as the parameter S passes back and 
forth through the region of bistability. That is, for low S’,, the system is at rest in the 
lower steady state (which is globally attracting). As S increases, the control system 
remains at this lower steady state, even after passing into the region of bistability 
because the lower steady state is stable with respect to small perturbations. Finally, 
as S increases past the upper bifurcation point (S.2), the system abruptly shifts 
to the upper stable steady state. Now, if S were to decrease, the control system 
would remain in the upper steady state until S falls below the lower critical value, 
S.1. Only then will the system switch back to the lower steady state. This non- 
reversibility is called hysteresis. 


6.4.5 Mutual Antagonism 


Mutual antagonism is a situation where an increase in either species means a 
decrease in the other, as in figure 6.11. Here, X phosphorylates Y, so more X 
implies less Y. Further, Y degrades X, so more Y means less X. The equations for 
this module are: 








dX 1 
dY _ kE(Yr-Y) k4 XY GH 
dt = Km3 + Yr -Y Km4 +Y ` 


where Yr = Y + Yp is constant, and the dimensions of the variables and rate 
constants are as before. In this case, the X-nullcline is now a hyperbola Y = 
tsak, which is similar to the negative feedback case. The Y-nullcline is Y = 
Yr - G(k3E, k4 X, Km3/Yr, Km4/Yr), which is a switch function that turns off as X 
increases. As in the case of positive feedback, there may be multiple intersections 
of the nullclines and a region of bistability for the parameter S (see figure 6.12a). 


Figure 6.12b shows a one-parameter bifurcation diagram for this system. 


Figure 6.11 Example of mutual antagonism. Wiring diagram. 
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Figure 6.12 Example of mutual antagonism. a) Phase portrait. b) One parameter 
bifurcation diagram. Parameter values are kı = k2 = k3 = ka = 1, ko 0.1, $= 0,125, 
Km3 = Kma = 0.05, Yr = 1, and E = 0.25. In a), trajectories move away from the 
unstable steady state (in the center) to one of two stable steady states. 








Recently there have appeared a number of interesting experimental studies of 
bistability in macromolecular regulatory networks: in the MAP kinase signaling 
pathway of frog eggs (Ferrell Jr. and Machleder, 1998; Xiong and Ferrell Jr., 2003), 
in the activation of MPF in frog egg extracts (Sha et al., 2003; Pomerening et al., 
2003), in the lactose utilization network of bacteria (Ozbudak et al., 2004), and in 
artificial genetic networks (Gardner et al., 2000). 





6.5 Networks That Oscillate 


There are three simple motifs that generate oscillatory behavior: activator-inhibitor, 
substrate-depletion, and delayed negative feedback. 


6.5.1 Activator-Inhibitor 


In figure 6.13, R stimulates its own production by phosphorylating E, and Ep also 
stimulates the production of X. (Think of Ep as the active form of a transcription 
factor.) As X increases, it promotes degradation of R. This negative feedback loop 
between X and R can cause oscillation (figure 6.14a). The equations for this system 


are 
dR 
dX 





where Ep = Er - G(ksR, ke, #23, 42%). The X-nullcline is X = (k3/ka)Ep and the 
R-nullcline is X = (ko Ep+k1S)/k2R. The phase portrait (figure 6.14a) clearly shows 
the tendency of the vector field to drive trajectories in a circulatory pattern. For 
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appropriate values of the parameters, the control system exhibits a closed trajectory, 
called a stable limit cycle. As the system rotates around the limit cycle, R(t) 
and X(t) oscillate periodically in time. The classic example of activator-inhibitor 
oscillations in cell biology is the cyclic AMP signaling system of the cellular slime 
mold, Dictyostelium discoideum (Martiel and Goldbeter, 1987); see box 6.5. 
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Figure 6.14 An activator-inhibitor oscillator. a) Phase portrait, b) One parameter 
bifurcation diagram. Parameter values are ko kı ko k3 ks ke 1, ka 0.5, 
S = 0.5, Kms = Kme = 0.1, and Er = 1. In a), trajectories spiral in towards a limit cycle 
surrounding the unique unstable steady state. In b), the min and max values of R on the 
limit cycle oscillation are plotted in the region between the two Hopf bifurcations, Sy 
and Sue. 








As we increase or decrease the signal strength, S, the R-nullcline shifts up or 
down, and though there is always only one steady state (one intersection of the 
nullclines), the stability of the steady state changes as we change S. For Sy1 < S < 
Sie, the steady state is unstable and surrounded by a limit cycle. The boundary 
points, Sq, and So, are called Hopf bifurcation points. Figure 6.14b plots the 
one-parameter bifurcation diagram for this system, along with the amplitude of the 
oscillatory solution where it exists. 
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Box 6.5: Cyclic AMP oscillations in Dictyostelium 

Cyclic AMP binds to a membrane receptor, which activates adenylate cyclase, the 
enzyme that catalyzes the synthesis of cyclic AMP from ATP (this is the positive 
feedback loop promoting the autocatalytic production of cyclic AMP). Meanwhile, 
cyclic AMP binding to the receptor promotes phosphorylation and desensitization of 
the receptor (this is the negative feedback loop, the desensitized receptor being the 
“inhibitor” that shuts off autocatalytic production of cyclic AMP). Next, cyclic AMP 
is hydrolyzed to 5’-AMP, which allows the receptor to slowly regain its sensitivity. 
Only then can there be a new burst of cyclic AMP synthesis. 





6.5.2 Substrate-Depletion 


In the substrate-depletion motif (figure 6.15), substrate X is converted by enzyme 
E into product R in a process which is autocatalytically amplified by R-dependent 
phosphorylation of E. This positive feedback loop leads to an explosive production 
of R which depletes the pool of the substrate, X. Naturally, once X is depleted, the 
production of R ceases and the degradation of R reduces its concentration below 
the level necessary to sustain the positive feedback loop. At this point, the pool 
of X begins to replenish. When X builds up sufficiently high, the positive feedback 
loop reengages, and a new burst of R synthesis commences. 





> E 





a 
1 
I 


’> X ’> R 


Figure 6.15 A substrate-depletion oscillator. Wiring diagram. 
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The differential equations for the model in figure 6.15 are 


dX 








q = fs- (ko + koEp)X (6.24) 
dR , 
a (ko + koEp)X — k2 R (6.25) 
where Ep = Er - G(k3R, ka, 4°, 524). The X-nullcline is X = oe and the 
0 0) “p 
R-nullcline is X = e Again, the phase portrait (figure 6.16a) shows a cir- 
0 0)“p 


culatory pattern around the steady state, and for a suitable choice of parameters, 
the system executes a stable limit cycle oscillation. In this case, the X-nullcline 
shifts upward (downward) as S increases (decreases). As before, the one-parameter 
bifurcation diagram shows two Hopf bifurcations and oscillatory solutions in be- 
tween (figure 6.16b). Substrate-depletion oscillations are common in biochemical 
networks (see table 6.1). 
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Figure 6.16 A substrate-depletion oscillator. a) Phase portrait. b) One parameter 
bifurcation diagram. Parameter values are ko kı ko k3 ka 1; ko 0.1, 
S = 1, Km3 = Kma = 0.1, and Er = 1. In a), trajectories spiral in towards a limit cycle 
surrounding the unique unstable steady state. In b), the min and max values of X on the 
limit cycle oscillation are plotted in the region between the two Hopf bifurcations. 








Table 6.1 Examples of substrate-depletion oscillators. 





Example Substrate Activator Reference 

Frog egg Cyclin B MPF (Tyson, 1991) 

Glycolysis F6P+ATP FDP+ADP (Selkov, 1968) 

Calcium Ca?* in ER“ Ca?* in cytosol (Dupont et al., 1991) 
(Maynard-Smith, 1974) 





Ecosystem Prey Predator 





* ER = endoplasmic reticulum. 


6.5.3 Delayed Negative Feedback 


In figure 6.17, we present an example of delayed negative feedback. In this scheme, 
R phosphorylates E, which then binds to C to form X, and X is the active complex 
that degrades R itself (closing the negative feedback loop). This motif is derived 
from components of the cell cycle regulatory mechanism in eukaryotes, where R is 
MPF (mitosis promoting factor), E is APC (anaphase promoting complex), C is 
Cdc20, and X is a complex of APC and Cdc20. 

The corresponding set of equations is 








dR 
a = kS—kXR (6.26) 
dEp _ k3R(Er — Ep) k4QEp 
dt 7 Kmk =F Er = Ep Kmp ag Ep 
—ks[Ep(Cr — X) — KaX] (6.27) 
dX 
— = k|Ep(Cr-— X)- KaX] (6.28) 
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Figure 6.17 A negative feedback oscillator. Wiring diagram. 


where Er = E + Ep is the total concentration of the APC, Q is the (fixed) 
concentration of a phosphatase, and Cr = C + X is the total concentration of 
Cdc20. Having left the familiar territory of two-variable systems and phase plane 
portraits, we must now rely on numerical and qualitative results. 


Table 6.2 Parameter values for the delayed negative feedback oscillator. 








Parameter Description Value Units 

kı 1°*_order rate const 1 min + 

k2 2”4_order rate const 1 nM imin”! 
k3 1°*-order rate const 1 min! 

ka 1°*_order rate const 1 min”! 

ks, 2”4_order rate const 0.01 nM !min7?! 
Kmk Michaelis constant 1 nM 

Kmp Michaelis constant 1 nM 

Ka Equilibrium constant 50 nM 

S Signal 0.3 nM 

Q Phosphatase concen. 100 nM 

Er Total APC concen. 100 nM 

Cr Total Cdc20 concen. 1 nM 








Using the parameter values in table 6.2 and S' as the control parameter, we can 
compute a one-parameter bifurcation diagram (figure 6.18a) using numerical tools. 
In this case, there are two critical values of S at which the system undergoes Hopf 
bifurcations, with oscillatory solutions in between, 0.2 < S < 0.4 (roughly). A 
typical oscillation for S in this range is plotted in figure 6.18b. 

Small amplitude oscillations due to a “pure” negative feedback loop have recently 
been observed by Pomerening et al. (2005) in frog egg extracts (see box 6.6). A 
long negative feedback loop on PER-protein synthesis seems to play a major role 
in circadian rhythms, as described in chapter 2. 
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Figure 6.18 A negative feedback oscillator. a) One parameter bifurcation diagram. 
b) Simulation for S = 0.3. See table 6.2 for parameter values. Definitions: u = R/Q, 


v = Ep/Er, w = X /Cr. Ina), the min and max values of v on the limit cycle oscillation 
are plotted in the region between the two Hopf bifurcations. 





Box 6.6: Negative feedback oscillations in frog egg extracts 

Frog egg extracts are convenient preparations in which to observe the negative feed- 
back loop involving MPF and APC, although the native regulatory system also 
includes a substrate-depletion oscillator involving phosphorylation of MPF (see ta- 
ble 6.1). By clever experimental techniques, Pomerening et al. (2005) have knocked 
out the substrate-depletion oscillator in a frog egg extract, revealing the negative feed- 
back oscillator in its (presumably) unadulterated state. They observed “pure” negative 
feedback oscillations in their preparations. In the absence of the self-amplification of 
MPF activity provided by the substrate-depletion motif, the pure negative feedback 
oscillations are of considerably smaller amplitude and drive ambiguous transitions 
into and out of mitosis. It seems that the positive feedback mechanism is important 
to amplify the negative feedback oscillations and give unambiguous signals to nuclei 
to enter and leave mitosis. 














6.6 A Multiple-Feedback Network: p53 and Mdm2 


Transcriptional activator p53 is involved in cell cycle arrest and apoptosis (pro- 
grammed cell death). In normal cells, the level of p53 is kept low by Mdm2, which 
promotes degradation of p53. The transcription of Mdm2 is activated by p53, cre- 
ating a negative feedback loop (p53 — Mdm2 —| p53). When a cell is subjected 
to environmental stress causing DNA damage or oncogene activation, the activity 
of Mdm2 is weakened, allowing accumulation of p53 in the nucleus. Recently, it 
has been observed (Lahav et al., 2004) that p53 and Mdm2 undergo one or more 
oscillations in response to ionizing radiation (which causes double-stranded breaks 
of DNA), in an apparent attempt to repair the damage. Ciliberto et al. (2005) 
have proposed a simple mechanism (figure 6.1), including both negative and posi- 
tive feedback, which quantitatively reproduces this behavior. The equations for the 
network in figure 6.1 are 
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d{p53 , 
a = kss3 — kgs3{p53] — k¢[Mdm2nuc][p53] 
+k, [p53U] (6.29) 
d|/p53U , 
PES = k Mdm2nue]lp53] — k-[p53U] — kisa[p53U] 
—kş [Mdm2nuc][p53U] + k,[p53UU] (6.30) 
d 
det = kr [Mdm2nuc] [p53U] = kz [p53UU] 
—(kasg + kas3)[p53UU] (6.31) 
d|Mdm2evyt ; ks2[Pd3t 64] , 
y s2|P99tot 
— ~ — = k.4 k49 Mdm2 
dt s2 J% pa [P53tot]” azl m cyt! 
kph 
— ——— [Mdm2 + kdeph Mdm2P 6.32 
d|Mdm2Pcyt] k ' 
y ph 
= Mdm2 — kgo Mdm2P 
di Jpn + ip5sia]. m cyt] a2[Mdm cyt] 
—kdeph [Mdm2P cyt] — ki [Mdm2Pey¢] 
+ko[Mdm2yuc] (6.33) 
d|Mdm2 
! a nuc) _ Vratio(ki[Mdm2P oy] — ko[Mdm2nuc]) 
—kae [Mdm2nuc| (6.34) 
d[DNA Jam] [DNA4 ] 
= kadam IR|— krep|p53 arn 6.35 
where 
# [DNAg | Ww 
k = kg + aut k 6.36 
d2 d2 Tiii F [DNAdam] d2 ( ) 
[P53tot] = [p53] + [p53U] + [p53UU] 6.37) 
[Mdm2tot] = [Mdm2ey¢] + [Mdm2P cyt] + [Mdm2nuc| (6.38) 
ratio 
Ve toplasm 
Vra W = ———o 6.39 
í Vnucleus ) 
[IR] = imposed dose of ionizing radiation (6.40) 


The network contains a long negative feedback loop (p53 — Mdm2eyt > 
Mdm2P cyt — Mdm2nuc —| p53) and a long positive feedback loop (p53 —> PTEN 
| PIP3 > Akt > Mdm2Pey_ —> Mdm2nuc —| p53). The positive feedback loop 
is shortened to p53 —| Mdm2Peyt — Mdm2nue —| p53. 
A simulation of this network (figure 6.19) compares very favorably with the 





experimental observations of (Lahav et al., 2004). As the radiation dose increases 
(figure 6.19d), the number of pulses of p53 increases. The reason for this curious 
“digital” response of p53 to DNA damage is made clear by the one-parameter 
bifurcation diagram in figure 6.20, where we plot system response, [p53tot|, as 
a function of the extent of DNA damage, measured by ka2. The positive feedback 
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Table 6.3 Parameter values for p53-Mdm2 network 





Rate Constants (min™ +) 
kss3 = 0.055 kisa = 0.0055 kass = 8 





kp = 8.8 kp = 2.5 kla = 0.0015 
ks2 = 0.006 ky = 0.01 kio = 0.01 
kpn =0.05 — Kaepn = 6 ki = 14 

ko = 0.5 kaam =0.18 krep = 0.017 








Other Constants (dimensionless) 
Jaa Jpn = 0.01 Iresi 
Jdam = 0.2 Vratio =15 m=3 
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Figure 6.19 Simulation of gamma-irradiation experiment (reproduced from (Ciliberto 
et al., 2005), with permission). At the beginning of the simulation, the system is at steady 
state. (A) Between time 10 and 20, the control system is exposed to a transient damaging 
agent, which induces two large amplitude oscillations in p53¢o, and Mdm2nuc. (B) The 
oscillations of the two cytoplasmic forms of Mdm2 have a smaller amplitude compared to 
Mdm2nu- concentration in panel (A). (C) The oscillations are initiated as a consequence 
of kaz increase, which is induced by irradiation. As the damage is repaired, kaz decreases 
back to its basal value. (D) The number of pulses increases with the amount of damage. In 
the simulation, we count the number of oscillations as a function of the irradiation time. 
In panels A through C irradiation time = 10 min. 
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in the network creates multiple steady states (the S-shaped curve at kag S 0.01), 
but the negative feedback loop makes the upper steady state unstable. The unstable 
upper steady state is surrounded by stable limit cycle oscillations of [p53; 4] and 
[Mdm2; +]. The region of stable oscillation is bounded above (at ka2 = 0.853) by 
a Hopf bifurcation, and below (at ka2 = 0.0135) by a saddle-node-loop bifurcation. 
For a broad range of values of kg2, that is, of DNA damage, the system responds 
with pulses of p53 and Mdm2 of fixed amplitude and period, exactly as observed by 
Lahav et al. (2004). As the damage is repaired, kgg drops toward kag = 0.01, and 
the oscillations disappear abruptly as kaz crosses the saddle-node-loop bifurcation 
point. 





Figure 6.20 Bifurcation diagram (reproduced from (Ciliberto et al., 2005), with permis- 
sion). Recurrent states (steady states and limit cycles) for p53;o¢ are plotted as functions 
of ka2, the degradation rate of Mdm2nuc. The solid line represents stable steady states, 
the dotted line unstable steady states. Black dots are the maxima and minima of the 
stable limit cycles. The grey solid line represents pd3to4 as a function of kaz from the 
simulation shown in figure 6.19. Notice that in figure 6.19 kaz is a variable (see equations), 
while here it is a parameter (all other equations and parameter values as in table 6.3). 
When the qualitative behavior of the system changes, it is said to undergo a bifurcation. 
In the p53/Mdm2 model there is a saddle-node (SN) bifurcation at ka2=0.0018 and a 
saddle-node-loop (SNL) bifurcation at kg2=0.0135. Before the SNL bifurcation there is 
only one stable steady state, with low p53 (“p53 OFF”); after the SNL the steady state 
becomes unstable, surrounded by a stable limit cycle. The family of stable limit cycles 
disappears at a Hopf bifurcation at ka2 =0.8532 (not shown on the diagram). 


A somewhat different model of p53/Mdm2 oscillations in response to ionizing 
radiation has recently been published by (Ma et al., 2005). 
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6.7 Conclusions 


How are cell biologists to make reliable connections between molecular interaction 
networks and cell behaviors, when intuition fails in all but the simplest cases? In 
this chapter, we propose to make the connection by translating the reaction network 
into a set of nonlinear differential equations that describe how all the interacting 
species are changing with time (figure 6.21). 


The Dynamical Perspective 
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Figure 6.21 A dynamical perspective on molecular cell biology. To make a connection 
between molecular mechanisms and cell physiology, we convert the mechanism into a set 
of kinetic equations, by standard principles of biochemical kinetics, and view the kinetic 
equations as defining a vector field in the state space of the dynamical variables. The 
vector field has attracting solutions (steady states and oscillations) that correspond to 
characteristic physiological responses of the cell. The dependence of these attractors on 
kinetic constants (hence, on genetics and environment) are robustly captured in bifurcation 
diagrams. 


The differential equations define a vector field in the state space of the network. 
The vector field points to certain stable attractors, which can be correlated with 
long-term, stable behavior of the network and of the cell it governs. Transitions 
from one stable attractor to another represent the responses of the cell to specific 
perturbations (signals). A natural way to describe the signal-response properties of 
a regulatory network is in terms of a one-parameter bifurcation diagram, which effi- 
ciently displays the stable attractors (steady states and oscillators) and transitions 
between attractors as signal strength (the “parameter”) varies. 

We have illustrated these ideas with simple examples of linear, hyperbolic, and 
sigmoidal signal-response curves, of bistable switches based on positive feedback 
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or mutual inhibition, and of limit cycle oscillators based on substrate depletion, 
activator-inhibitor interactions, or time-delayed negative feedback. These funda- 
mental motifs (switches and oscillations) can be coupled together into networks of 
increasing complexity and dynamical potential. Interested readers should now be 
ready to read and understand the growing body of literature that takes this dynam- 
ical perspective on interesting topics in cell physiology. Some nice examples include: 
cell-cycle control (Tyson et al., 2002), circadian rhythms (Leloup and Goldbeter, 
1998), lysogenic viruses (Arkin et al., 1998), quorum sensing in bacteria (James 
et al., 2000; Usseglio Viretta and Fussenegger, 2004), NF-«B signaling (Hoffmann 
et al., 2002), and programmed cell death (Eissing et al., 2004). 





T Qualitative Approaches to the Analysis of 
Genetic Regulatory Networks 


Hidde de Jong and Delphine Ropers 


There is a growing demand for methods that can make predictions of qualitative 
properties of the dynamics of molecular interaction networks, that is, properties that 
are invariant for a range of reaction mechanisms and values of kinetic constants. 
On the one hand, precise and quantitative information on reaction mechanisms 
and kinetic constants is not available for most networks of biological interest. 
On the other hand, in many situations predictions of qualitative rather than 
quantitative dynamical properties are appropriate for gaining an understanding 
of the functioning of a molecular interaction network. This chapter discusses 
three examples of qualitative approaches for the analysis of genetic regulatory 
networks, allowing qualitative dynamical properties to be inferred from currently- 
available incomplete and non-quantitative data. The approaches are based on 
different formalisms, namely discrete abstractions of differential equations, Boolean 
networks, and graphs. We illustrate the approaches by means of a simple two-gene 
network and give an example of their application to real biological systems. 





7.1 Motivation for Qualitative Approaches 


Differential equations are the classical formalism for modeling the behavior of nat- 
ural and man-made systems. Therefore not surprisingly, they form the most promi- 
nent approach for the modeling, analysis, and simulation of molecular interaction 
networks (chapter 6). The application of differential equations rests on a well- 
established theoretical framework for the deterministic modeling of the kinetics of 
biochemical reaction systems (Cornish-Bowden, 1995; Heinrich and Schuster, 1996). 
In addition, a variety of mathematical methods and computer tools is available for 
transforming the model equations into experimentally-testable predictions. Many 
excellent examples exist to demonstrate the capability of differential equations to 
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help gain insight into the functioning of molecular interaction networks of biologi- 
cal importance, such as the control of circadian rhythms in Drosophila (Leloup and 
Goldbeter, 2003), the metabolism of the red-blood cell in humans (Joshi and Pals- 
son, 1989), the regulation of the cell cycle in yeast and higher eukaryotes (Tyson 
et al., 2001), and the signaling pathway involved in the maturation of oocytes in 
Xenopus laevis (Ferrell Jr. and Machleder, 1998). 

In principle, the use of differential equations allows precise numerical predictions 
of the behavior of molecular interaction networks to be made. However, for many 
networks of biological interest, such predictions are difficult or even impossible 
to obtain. In the first place, the biochemical reaction mechanisms underlying 
the interactions are usually not or incompletely known, which complicates the 
formulation of the differential equation models. In the second place, quantitative 
data on kinetic constants and molecular concentrations are generally absent, even 
for extensively studied textbook systems. 

In addition to these practical difficulties, one can raise the question of whether 
quantitative information on reaction mechanisms and kinetic constants is essential 
for understanding the functioning of molecular interaction networks. In fact, it is 
reasonable to assume that many important dynamical properties of living systems 
do not depend on precise numerical values or a specific reaction mechanism (Barkai 
and Leibler (1997); Eldar et al. (2002); Rao et al. (2004); see also chapter 2). 
In other words, in many situations qualitative dynamical properties—dynamical 
properties that are invariant for a range of reaction mechanisms and values of 
kinetic constants—are more important than quantitative dynamical properties. The 
qualitative properties express the intimate connection between the behavior of the 
system and the structure of the network of molecular interactions, independently 
from the quantitative details of the latter. 

For all of the above reasons, there is a growing interest in qualitative approaches 
for the modeling, analysis, and simulation of molecular interaction networks, ca- 
pable of inferring qualitative properties of the system dynamics from currently- 
available incomplete and non-quantitative data. The aim of this chapter is to re- 
view existing qualitative approaches focusing on one particular type of molecular 
interaction network, genetic regulatory networks. These networks mainly involve 
interactions between proteins and nucleic acids, controlling the transcription and 
translation of genes. In the next sections, we first explain the notion of qualitative 
dynamical property and then discuss three representative examples of qualitative 
approaches. These approaches are based on formalisms increasingly remote from the 
differential equation models traditionally used: discrete abstractions of differential 
equations, Boolean networks, and graphs. 

Of course, our review of qualitative approaches does not pretend to be exhaustive. 
Some of the more obvious omissions of this chapter are Petri nets (Koch et al., 
2005; Reddy et al., 1996), constraint-based models (Covert et al., 2004; Edwards 
and Palsson, 2000; Edwards et al., 2001a; Stelling et al., 2002), and process algebras 
(Eker et al., 2002; Regev et al., 2001). On the one hand, these model formalisms 
partially overlap with the formalisms discussed here, or are reviewed at length in 
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other chapters of the book (chapter 5). On the other hand, they seem to have been 
mostly used for metabolic and signal transduction networks, rather than for genetic 
regulatory networks. For further reference, the reader may consult other reviews 
of qualitative approaches in systems biology (de Jong, 2002; Gagneur and Casari, 
2005). Another restriction of the scope of this chapter is that we focus on methods 
that derive behavior predictions from structural information on the network, thus 
leaving out of consideration methods that aim at inferring the network structure 
from observations on the behavior of the system. Such reverse engineering methods 
have been developed for Boolean networks and their relatives (Akutsu et al., 2000; 
Ideker et al., 2000; Laubenbacher and Stigler, 2004; Liang et al., 1998; Perkins et al., 
2004), while Bayesian methods for inferring graph models from gene expression data 
are discussed in chapter 4. 





7.2 Qualitative Properties of the Dynamics of Genetic Regulatory Networks 


In order to develop the notion of qualitative dynamical property, we will consider 
a simple network of two genes (figure 7.1). Each of the genes encodes a regulatory 
protein that inhibits the expression of the other gene, by binding to a site over- 
lapping the promoter of the gene. Simple as it is, this cross-inhibition network is 
a basic component of more complex, real networks and makes it possible to ana- 
lyze some characteristic aspects of cellular differentiation (Monod and Jacob, 1961; 
Thomas and d’Ari, 1990). Moreover, its dynamical properties have been experimen- 
tally tested by Gardner et al., who have reconstructed the network in Escherichia 
coli cells by cloning the genes on a plasmid. The genes on the plasmid have been 
chosen such that the network functions independently from the rest of the cell 
and the activity of the corresponding proteins can be regulated by external signals 
(Gardner et al. (2000); see also chapter 13). 
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Figure 7.1 Example of a simple genetic regulatory network, composed of two genes a 
and b, the proteins A and B, and their regulatory interactions. 


The cross-inhibition network can be modeled by means of differential equations. 
Generally speaking, a genetic regulatory network of n genes is conveniently de- 
scribed by a system of n ordinary differential equations: 


dx; 3 
ap = file), ie {Ln}, (7.1) 
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where & = (%1,...,2%n)/ E€ Q is a vector of cellular protein concentrations, and 
Q C Ro. The function fi : Q — R expresses how the rate of change of the 
concentration of the protein encoded by gene 7 depends on the concentrations a of 
the proteins in the cell. A differential equation model of the cross-inhibition network 
is shown in figure 7.2a. The variables x, and x» represent the concentration of the 
proteins A and B. The time derivative of £a equals the difference between the rate 
of synthesis of A, given by Ka h` (xb, 0b, mp), and the rate of degradation, given by 
Ya Za. The use of the sigmoidal Hill function h~, shown in figure 7.2b, means that 
for low concentrations of the protein B, gene a is expressed at a rate close to its 
maximum rate Ka, whereas for high concentrations of B, the expression of the gene 
is almost completely repressed. The shape of the Hill function is in agreement with 
experimental evidence (Ptashne, 1992). The rate of degradation of A is proportional 
to the concentration of the protein. The differential equation for x, has an analogous 











interpretation. 
dxq 
T = Ka ho (Zi; Ob, mp) — Ya Ta 
d; 
= = ko h (Za, 8a, Ma) — Yo Lb 
gm 

h` (x,0 = ——_ 

(2,0, m) = F 

(a) (b) 


Figure 7.2 (a) Nonlinear ordinary differential equation model of the cross-inhibition 
network (figure 7.1). The non-negative variables £a and x, correspond to the concentra- 
tions of proteins A and B, respectively, the parameters Ka and «Kp to the synthesis rates 
of the proteins, the parameters ya and y to degradation constants, the parameters 0, 
and 0, to the threshold concentrations, and the parameters Ma and m, to the degree of 
cooperativity of the interactions. All parameters are positive. (b) Graphical representation 
of the characteristic sigmoidal form, for m > 1, of the Hill function h` (x, 0, m). 


The use of the nonlinear Hill function does not make it possible to analytically 
solve the system of differential equations. However, the dynamics of the two-gene 
network can be analyzed in the phase plane, by means of standard techniques 
developed in dynamical systems theory (Kaplan and Glass (1995); Strogatz (2000); 
see also chapter 6). The phase portrait in figure 7.3a shows that the system is 
bistable, in the sense that it possesses two asymptotically stable equilibrium points, 
at which either protein A or protein B is present at a high concentration. The third 
equilibrium point, characterized by intermediate concentrations for proteins A and 
B, is unstable and cannot be experimentally observed. The phase-plane analysis 
also reveals that the system exhibits hysteresis. If one strongly perturbs the system 
from one of its stable equilibria—for instance, by provoking the degradation of 
the protein present at a high concentration—the other equilibrium can be reached 
(figure 7.3b). From then onwards, even if the source of strong degradation has 
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disappeared, the system will remain at the new equilibrium. In other words, the 
analysis suggests that a simple molecular mechanism may allow the system to switch 
from one functional state to another. Interestingly, this has been confirmed by the 
experiments of Gardner et al. mentioned above. 


aA ni 

















(a) (b) 
Figure 7.3 (a) Phase portrait of the differential equation model of the cross-inhibition 
network (figure 7.2). The system has two asymptotically stable equilibrium points (se) and 
one unstable equilibrium point (ue). The equilibria lie at the intersection of the nullclines 
of £a and zx, (drawn curves annotated by d£a/dt = 0 and dx,/dt = 0). (b) Hysteresis 
effect, resulting from a transient perturbation of the system (dashed line with arrow). 


The above-mentioned dynamical properties of the cross-inhibition network— 
bistability and hysteresis—are invariant for a range of parameter values and molecu- 
lar mechanisms. That is, they are qualitative properties of the system. For instance, 
a moderate increase of the value of 6, causes the nullcline of £a to move upwards 
(figure 7.4a). This deforms the phase portrait, but does not lead to the loss of the 
bistability and hysteresis properties. For large changes in parameter values though, 
the qualitative properties may not be invariant. Figure 7.4b shows what happens 
for values of 0, close to, or above, Ka/y». In this case one of the stable equilibria 
and the unstable equilibrium approach annihilate each other, so that the system is 
no longer bistable and no longer exhibits hysteresis. In the terminology of dynami- 
cal systems theory, a bifurcation has occurred (Kaplan and Glass (1995); Strogatz 
(2000); see also chapter 6). 

The invariance of the dynamical properties of genetic regulatory networks for 
changes in the reaction mechanisms and the value of kinetic constants can be defined 
more generally and more rigorously than has been done here. A classical treatment 
is found in the book by Andronov et al., who define qualitative dynamical properties 
as those properties invariant under trajectory-preserving topological mappings of 
(a region of) the phase space (Andronov et al., 1973; Kalagnanam et al., 1991). 
What will interest us in this chapter are not the technicalities of the definition, 
but rather the practical question of how qualitative properties of the dynamics of 
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Figure 7.4 Changes in the phase portrait of the differential equation model of the 
cross-inhibition network (figure 7.2), following a change in the value of the parameter 0). 
The change in (a) preserves the bistability and hysteresis properties, whereas the more 
important change in (b) does not. 


genetic regulatory networks can be inferred from weak information on the structure 
of the network and the type of the interactions. 

The example of the two-gene network shows that dynamical systems theory 
provides concepts and techniques for the characterization of qualitative properties 
of dynamical systems, notably the construction and analysis of phase portraits. 
Unfortunately, the theoretical results become much weaker when studying higher- 
order systems. While higher-order systems can sometimes be reduced to second- 
order systems, by time-scale abstraction or other model simplifications, this is not 
always possible. More fundamentally, the insights to be gained from dynamical 
systems theory are to a large extent based on geometrical representations that are 
difficult to manipulate in higher dimensions. 

The qualitative methods discussed in the remainder of this chapter also try to 
infer qualitative properties of the dynamics of genetic regulatory networks. However, 
they employ model formalisms and representations of the system dynamics that 
are more abstract than differential equations and phase portraits. Although the 
predictions that can be made by means of these qualitative methods are less precise, 
they are based on theoretical results and computational techniques that better scale 
up to large and complex systems. In the next sections, we will discuss three examples 
of qualitative methods, based on model formalisms that make increasingly stronger 
abstractions of the process of gene regulation. 





7.3 Discrete Abstractions of the Dynamics of Differential Equations 


In response to the problem that the use of phase portraits does not scale up well to 
higher dimensions, alternative representations of the qualitative dynamics could be 
proposed. A closer look at figure 7.3a, the phase portrait of the two-gene network, 
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suggests one such alternative. In every region of the phase plane bounded by the 
nullclines of x, and 2x», the system behaves in a qualitatively homogeneous way, 
in the sense that the derivatives of the concentration variables have the same sign 
everywhere. When solution trajectories leave one region and enter another, the sign 
of one or both derivatives changes. The partition of the phase space into regions 
suggests a discrete abstraction of the qualitative dynamics of the system, which will 
be formally developed in this section. 

Consider again the system of differential equations 7.1, describing a network of n 
genes. The nullclines of the system are the hypersurfaces on which f;(a) = 0. They 
define a partition R of the phase space Q, consisting of regions R in which the time 
derivative of each of the concentration variables x; has a unique sign. We introduce 
a function 7: R — {—,0,+}”, associating a derivative sign pattern to each region 
R € R. Figure 7.5a shows the partition of the phase space obtained in the case of the 
model of the two-gene network discussed in the previous section, while figure 7.5c 
shows the derivative sign pattern for each region. Suppose that R, R’ € R are two 
contiguous regions in the phase space. If there is a solution of equation 7.1 that 
on a time interval T reaches R’ from R, without leaving RU R’, then we say that 
there exists a transition from R to R’, denoted R — R’. Formally, > C Rx R. 
For instance, by looking at the direction of the vector field in figure 7.3a, it can 
be easily inferred that R! — R8, Rt — R?, and R! — R! are possible transitions. 
Self-transitions like R? — R! occur when solutions do not instantaneously cross a 
region, but remain in it for some time. 

The above definitions underlie an abstract, discrete representation of the dynam- 
ics of the continuous differential equation system in the form of a transition graph: 


TG = (es): (7.2) 


The vertices of the graph correspond to the regions of the phase space and the edges 
to the transitions between regions. Each of the regions can be seen as a discrete 
or qualitative state of the system, in which the derivatives of the concentration 
variables have a particular sign pattern. A sequence of regions ø is a path in 
the transition graph if and only if ø = (R°) or ø = (R?,..., R™®), m > 0, and 
for all i € [0,...,m — 1], we have Rê — Rtt. A path in the transition graph 
gives a qualitative description of the behavior of the system, in the sense that it 
describes how the derivative sign pattern changes over time. The transition graph 
corresponding to the cross-inhibition network is shown in figure 7.5b. 

This discrete representation of the dynamics of a continuous differential equation 
system facilitates the analysis of the behavior of genetic regulatory networks. In fact, 
equilibrium points correspond to regions R € R, such that 7(R) = (0,...,0)’, while 
the stability of the equilibrium points can be inferred from the outgoing transitions 
of the contiguous regions. As expected, the transition graph in figure 7.5b contains 
three regions coinciding with equilibrium points. In the case of R° and R!°, the 
paths starting in the contiguous regions lead towards the equilibrium points, thus 
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Figure 7.5 (a) Partition of the phase space for the differential equation model of the 
cross-inhibition network (figure 7.2), using the nullclines d£a/dt = 0 and dx,/dt = 0. (b) 
Transition graph consisting of domains and transitions between domains. The small dots 
next to domains indicate self-transitions. (c) Sign of d£a/dt and dz,/dt in the regions of 
the phase space, as defined by the function 7. 


suggesting that the latter are stable. This is not the case for R°’, which corresponds 
to an unstable equilibrium point. The hysteresis property can also be inferred from 
the transition graph, bearing in mind that for a region R’ to be reachable from 
another region R, there must be a path running from R to R’. It is then immediately 
seen from figure 7.5b that a perturbation from R to R! may cause the system to 
attain the other equilibrium point at R1. More generally, the transition graph 
can be shown to be a conservative approximation (Alur et al., 2000; Chutinan and 
Krogh, 2001) of the differential equation system, in the sense that every solution of 
the latter corresponds to some path in the former (although the converse does not 
necessarily hold). This means that the transition graph can be safely used to study 
the qualitative dynamics of the differential equation system. 

The above reformulation of the study of qualitative properties of differential 
equation systems raises two important questions. First, for which range of parameter 
values is the transition graph invariant? Second, how can we actually compute the 
transition graph in the absence of precise numerical information on the parameters? 
These problems have been addressed in several areas of computer science and control 
theory, in particular in the context of work on qualitative simulation in artificial 
intelligence (de Jong, 2005; Kuipers, 1994) and on discrete abstractions in hybrid 
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systems theory (Alur et al., 2000; Chutinan and Krogh, 2001). Examples of the 
application of these approaches to the analysis of genetic regulatory networks are 
the hybrid automaton models of Ghosh and Tomlin (2004) and the qualitative 
differential equation models of Heidtke and Schulze-Kremer (1998). Below we will 
discuss in more detail the qualitative simulation method developed in our group 
(Batt et al., 2005a; de Jong et al., 2004b), which has been specifically designed so 
as to favor scaling up to large and complex networks. 

The qualitative simulation method uses simplified models of gene regulation 
proposed by Glass and Kauffman in the early seventies (Edwards et al., 2001b; 
Glass and Kauffman, 1973; Gouzé and Sari, 2002; Mestl et al., 1995). The major 
difference between these so-called piecewise-linear differential equations and the 
differential equations used to model the cross-inhibition network is that the sigmoid 
function h~ in figure 7.2b is replaced by a step function s~ that abruptly changes 
from 1 to 0 at a threshold value 6: 


a 0, if £ > 0. (7:3) 


By means of the threshold values of the variables, the phase space can be par- 
titioned into hyperrectangular regions, in each of which the system behaves in a 
qualitatively-homogeneous manner.! It has been proven that the transition graph 
defined on this partition is invariant for certain inequality constraints on the pa- 
rameters that can often be inferred from the experimental literature. Moreover, it 
is possible to compute the transition graph, by means of simple symbolic rules, 
from a piecewise-linear differential equation model of the network supplemented by 
inequality constraints. The qualitative simulation method has been implemented in 
the computer tool Genetic Network Analyzer (GNA) (de Jong et al., 2003a). 

The method and the computer tool have been applied to the analysis of the 
complex genetic regulatory network controlling the initiation of sporulation in 
the Gram-positive soil bacterium Bacillus subtilis (de Jong et al., 2004a). Under 
conditions of nutrient deprivation, B. subtilis cells may not divide and form a 
dormant, environmentally-resistant spore instead. The decision to abandon growth 
and division and initiate sporulation involves a radical change in the pattern of gene 
expression in the cell. The switch of the genetic program is controlled by a complex 
regulatory network integrating various environmental, cell-cycle, and metabolic 
signals. A graphical representation of the network controlling the initiation of 
sporulation is shown in figure 7.6a, displaying key genes and their promoters, 
proteins encoded by the genes, and the regulatory action of the proteins. 

The graphical representation of the network has been translated into a piecewise- 
linear differential equation model of the network supplemented by inequality con- 
straints on the parameters. The resulting model consists of 11 variables and 48 
parameters constrained by 70 parameter inequalities. The choice of the latter is 
largely determined by biological data. Using this model, the response of wild-type 
and mutant cells to nutrient deprivation has been simulated by means of GNA. This 
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Figure 7.6 (a) Key genes, proteins, and regulatory interactions making up the network 
involved in B. subtilis sporulation (de Jong et al., 2004a). (b) Path in the transition 
graph produced by qualitative simulation of the response of a B. subtilis cell to nutrient 
deprivation. The figure shows how the threshold boundaries on the concentrations of 
af, KinA, and Spo0A evolve as a consequence of the successive region transitions. The 
concentration of o” transiently crosses the threshold 6, above which this sigma factor 
directs the transcription of genes essential for later stages of the sporulation process. 
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has given rise to transition graphs consisting of up to several hundreds of regions, 
most of which can be simplified by eliminating regions that are instantaneously 
traversed and therefore of limited biological interest. An example of a path in such 
a transition graph is shown in figure 7.6b. 

Analysis of the sporulation network by means of GNA has revealed that essential 
features of the initiation of sporulation in wild-type and mutant strains of B. subtilis 
are reproduced by the model (de Jong et al., 2004a). In particular, the choice 
between division and sporulation is seen to be determined by competing positive 
and negative feedback loops influencing the accumulation of the phosphorylated 
form of the transcription factor Spo0A. Above a certain threshold concentration, 
Spo0A~P activates various genes whose expression commits the bacterium to 
sporulation, such as genes encoding sigma factors that control the alternative 
developmental fates of the mother cell and the spore (figure 7.6b). Other examples 
of the application of GNA are the qualitative modeling and simulation of quorum 
sensing in the pathogenic bacterium Pseudomonas aeruginosa (Usseglio Viretta and 
Fussenegger, 2004) and the carbon starvation response in E. coli (Ropers et al., 
2005). 

In summary, the basic idea informing discrete abstractions of the dynamics of 
continuous systems is that they partition the phase space into regions in which the 
system behaves in a qualitatively homogeneous manner. The state of the system is 
henceforward described by the region in which it resides, and a change of state by a 
transition from one region to another. In comparison with the underlying continuous 
system, the use of discrete abstractions leads to a loss of quantitative precision. 
However, for many questions the abstract description is sufficiently informative and 
well-adapted to the available biological data. Moreover, transition graphs are easy 
to analyze and capture qualitative properties of the system that are invariant for 
moderate changes in parameter values. The transition graphs grow exponentially 
with the number of genes in the network, though, which limits the scalability of 
the approach. Tools for the formal verification of qualitative dynamical properties 
reduce this problem to some extent (Batt et al., 2005b; Bernot et al., 2004; Chabrier- 
Rivier et al., 2004), but cannot entirely avoid it. 





7.4 Boolean Networks 


Instead of discretizing the dynamics of a continuous model, one could also study 
qualitative properties of genetic regulatory networks by directly starting with a 
discrete model. The sigmoid shape of the Hill function in figure 7.2b suggests that, 
to a first approximation, a gene can be described as either active (on) or inactive 
(off). That is, if the gene is active (inactive), the protein it encodes is assumed 
present (absent) in the cell. The change in gene expression can be described by 
making the assumption that the change in activation state of a gene is determined 
in a combinatorial fashion by the activation state of other genes, in particular 
genes encoding regulatory proteins. The above intuitions have been formalized by 
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Boolean networks, which has become popular in the wake of a groundbreaking 
study by Kauffman (1969; see Kauffman (1993); Somogyi and Sniegoski (1996) for 
reviews). 

Let the vector @ = (%1,...,%n)! € {0,1}” of Boolean variables represent the 
state of a network of n genes. Each ĉ; has the value 0 (inactive) or 1 (active). The 
activation state ĉ; of a gene at a discrete time-point t+1 is determined by a Boolean 
function 6; : {0,1}" — {0,1}, which defines ĉ;(t + 1) in terms of &(t). Most of the 
time, b; will effectively depend on the state of only k; of the n genes. The variable 
ĉi is also referred to as the output of the gene and the k; variables from which it is 
computed the inputs. For k; inputs the total number of possible Boolean functions 
b; mapping the inputs to the output is 22". This means that for ki = 2 there are 
16 possible functions, the logical “AND” and the logical “OR” being two examples. 
In summary, the dynamics of a genetic regulatory network can be described by a 
Boolean network defined by the following equations: 


ĉilt +1) = 6,(@(t)), ie {1,...,n}, (7.4) 


where b; maps k; inputs to an output value. The Boolean network corresponding 
to the cross-inhibition network is shown in figure 7.7a. The network is quite simple, 
consisting of two Boolean variables, ĉa and ĉ,, each connected to the other variable 
by means of a logical “NOT”. For illustrative purposes, an example of a slightly more 
complex Boolean network—involving three variables, two inputs per variable, and 
various logical functions—is shown in figure 7.7b. 

The dynamics of a Boolean network are conveniently represented by means of 
a transition graph. Let @,#’ € {0,1}" be two states of a Boolean network with n 
genes. There exists a transition from ĉ to #’, denoted by @ >, #’, if and only if 
ê! = b;(#), for every i € {1,...,n}. Formally, >, {0,1}” x {0,1}". The subscript 
s indicates that the transitions are synchronous, that is, the states of all genes are 
updated simultaneously. Notice that the transitions are deterministic, in the sense 
that every state of the system has a single successor. The transition graph can be 
formally defined as follows: 


BTG = ({0,1}",—.). (7.5) 


A sequence of states ¢ in the transition graph is a path if and only if ô = (2°) 
gitl 


Aam 


or 6 = (#°,...,#™), m > 0, and for all i € [0,...,m — 1], we have #' —, ĉ 
Because the number of states in the state space of a Boolean network is finite, when 
extended every path will eventually reach an attractor, either a state having itself as 
a successor (point attractor) or a state cycle (cyclic attractor). The attractor states 
and the states leading to the attractor together constitute the basin of attraction 
of the attractor. For simple networks the attractors and their basins of attraction 
can be calculated by hand, but for larger systems the use of computer programs 
becomes inevitable, given that the size of the transition graphs scales exponentially 
with the number of genes (section 7.3). Examples of such computer programs are 
DDLab (Wuensche, 2003) and GINsim (Chaouiya et al., 2003). 
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(b) 
Figure 7.7 (a) Boolean network model of the cross-inhibition network in figure 7.1, 
in the form of an electronic circuit and a system of equations. (b) Illustration of a more 
complex Boolean network model. ĉa, @», ĉc are Boolean variables; NOT, OR, NOR, NAND 
are Boolean functions. 


The transition graph for the cross-inhibition network is shown in figure 7.8a. 
Because the network consists of two genes, we have a total of four states, denoted 
by 00, 01, 10, and 11. For instance, 01 means that gene a is off and gene b is on. The 
graph consists of three unconnected attractors: the point attractors 10 and 01, and 
a cyclic attractor consisting of 00 and 11. Notice that one of the proteins is present 
and the other absent in the point attractors, which allows these states to be related 
to the stable equilibrium points in the differential equation model (figure 7.3a). On 
the other hand, the cyclic attractor has no obvious counterpart in the differential 
equation model. Moreover, the hysteresis property of the cross-inhibition network 
is not preserved. When perturbing the activation state of one of the genes in a 
point attractor, that is, when randomly flipping the Boolean value of ĉa or Zp, the 
system makes a transition to 00 or 11. From there it can neither return to its original 
state nor reach the other point attractor. This illustrates that there are situations 
in which the idealizations underlying Boolean networks are not appropriate, in 
the sense that the models cannot account for experimentally-observed dynamical 
properties. 

Several generalizations of the standard Boolean network formalism have been 
proposed, based on assumptions that are more realistic from a biological point 
of view. Instead of using synchronous transitions between states, one could resort 
to asynchronous transitions, in which the activation state of only a single gene is 
updated at a time. Formally, this amounts to replacing —, by a new transition 
relation >,C {0,1}” x {0,1}, where @ >, ĉ' if and only if # = 6,(#), for 
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Figure 7.8 ‘Transition graphs for the Boolean network corresponding to the cross- 
inhibition network (figure 7.7a). (a) Transition graph for synchronous transitions (—>s). 
(b) Transition graph for asynchronous transitions (—>a). 


some i € {1,...,n}, and @ = ĉj, for all j # i.2 The use of +, makes the 
Boolean network nondeterministic, in the sense that a state may have up to n 
successors. The asynchronous transition graph for the cross-inhibition network 
is shown in figure 7.8b. As can be immediately verified, both the bistability 
and the hysteresis property are now reproduced. Asynchronous Boolean networks 
underlie the logical method introduced by Thomas, who also proposes the use of 
multivalued instead of Boolean variables, in order to distinguish multiple levels of 
gene expression (Thomas, 1973; Thomas and d’Ari, 1990). Another generalization of 
the standard formalism are probabilistic Boolean networks, Boolean network which 
do not associate a single Boolean function with a gene, but rather a probability 
distribution on a set of Boolean functions, thus taking into account uncertainty in 
the state transitions (Shmulevich et al., 2002a,b). 

Boolean network models and their generalizations have been able to give insights 
into the functioning of actual genetic regulatory networks, as demonstrated by 
studies of pattern formation in early Drosophila development (Albert and Othmer, 
2003; Sánchez et al., 1997; Sanchez and Thieffry, 2001, 2003), flower morphogenesis 
in Arabidopsis (Mendoza et al., 1999), and mucus production in Pseudomonas 
aeruginosa (Bernot et al., 2004). The results confirm the basic assumption of 
qualitative approaches that many important dynamical properties of an organism do 
not depend on specific reaction mechanisms or precise numerical values for kinetic 
constants, but are to a large extent determined by the structure of interactions of 
the network (chapter 2). 

More generally, standard Boolean networks have been a popular model for 
theoretical investigations of the relation between the structure and dynamics of 
genetic regulatory networks. The basic idea of this so-called ensemble approach, 
proposed by Kauffman (1993, 2004), is to consider the ensemble of Boolean networks 
sharing some structural properties, such as a particular number of inputs per 
gene or Boolean functions of a particular type. We can then randomly sample 
networks from the ensemble and provide statistics on their dynamical properties, 
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such as the number of attractors and the size of their basins of attractions. 
Under the above assumptions, it follows that the dynamical properties typically 
found for the sampled networks must be attributed to the structural properties 
defining the ensemble, and can hopefully be explained by the latter. The biological 
relevance of the ensemble approach is based on the hypothesis of Kauffman that 
real genetic regulatory networks belong to an ensemble whose defining properties 
remain to be discovered (2004). The aim is to identify this ensemble by the iterative 
generation of ensembles and the comparison of their typical dynamical properties 
with experimental data. The simplicity of standard Boolean networks makes them 
excellent models for doing the extensive computations required by the ensemble 








approach. 
ô Q Q 
AND OR XOR 
l dl i k i i 


Figure 7.9 Examples of (a)-(b) canalizing and (c) non-canalizing Boolean functions 
with inputs i; and iz, and output ô. In the case of the AND (OR) function, a value of 0(1) 
for one of the inputs forces the output to 0(1). In the case of the XOR function no such 
value for one of the inputs exists. 


An interesting recent application of the ensemble approach is a study of the 
logical functions expected to play a role in gene regulation (Kauffman et al., 
2003b). A gene is regulated by many transcription factors, which may combine 
to yield a complex regulatory logic, as demonstrated by the analysis of the control 
of expression of the Endo16 gene in the sea urchin (Yuh et al., 1998). However, there 
is some evidence that one particular class of logical functions, so-called canalizing 
functions, are overrepresented in gene regulation (Harris et al., 2002). In terms 
of Boolean logic, a canalizing function has at least one input, such that for at 
least one value of this input and for any other value of the remaining inputs, the 
output value is fixed to either 0 or 1 (Kauffman, 1993) (figure 7.9). Kauffman et 
al. have generated random Boolean networks of thirty genes having a structure of 
interactions equal to that of the core of the yeast transcriptional regulatory network. 
The Boolean functions in the networks are chosen from either a distribution of all 
Boolean functions or a distribution of canalizing Boolean functions. The networks 
sampled from the two ensembles show different stability properties, that is, they 
tend to react differently to random perturbations of an initial state. In fact, networks 
with canalizing Boolean functions are on average more stable than networks with 
arbitrary Boolean functions, in the sense that in the former case the state after 
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the perturbation remains closer to the initial state (Kauffman et al., 2003b, 2004). 
Since this stability or robustness is expected on biological grounds (chapter 2), the 
results could be taken as evidence that actual genetic regulatory networks belong 
to an ensemble of canalizing networks. 

Contrary to the approach discussed in the previous section, Boolean networks 
do not have any intrinsic abstraction relation to an underlying continuous system. 
Standard Boolean networks describe the state of the system by a vector of Boolean 
values, indicating for each of the genes whether it is on or off. The state of 
the system evolves in discrete time, as a consequence of transitions that may 
change the activation state of one or more genes. The attractiveness of Boolean 
network models is based on the intuitiveness of the representation of gene regulation 
by means of Boolean functions and the simplicity of the algorithms used for 
computing the transition graphs. However, the classical formalisms make strong 
simplifying assumptions, in particular the use of binary values for gene activation 
and synchronous transitions. These assumptions are relaxed in the generalized 
formalisms mentioned, thus increasing the biological validity of the models, but 
at the price of losing some of the computational and mathematical simplicity of the 
standard approach. 





7.5 Graphs 


The previous section suggests another way to predict the behavior of genetic 
regulatory networks. If certain structural properties can be shown to imply specific 
dynamical properties, then the behavior of the system could be inferred, at least 
tentatively, by verifying whether the network possesses these structural properties. 
As shown in section 7.2, the cross-inhibition structure of the example network 
endows it with bistability and hysteresis properties for a large range of parameter 
values. Therefore, one might argue, identifying the cross-inhibition pattern in a 
network could provide us with a clue as to the dynamics of the system. This 
demands a study of the structural properties of genetic regulatory networks, for 
which graph models are well-suited. 

A graph is defined as a tuple (V, E), with V a set of vertices and E C V x V a 
set of edges (Berge, 2001): 


G= (V, E). (7.6) 


The edges represent the relation between vertices and may be directed or undirected. 
A directed edge is a pair (i,j) € E of vertices, where i denotes the head and 
j the tail of the edge. The order of the vertices is of no importance, if (i,j) is 
an undirected edge. A genetic regulatory network can now be seen as a directed 
graph in which the vertices represent genes and the edges regulatory interactions. 
The edges are directed from regulating to regulated genes, from genes encoding 
transcription factors to the targets of the transcription factors. In order to express 
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the nature of the regulatory interactions, we can label the edges. By defining a 
directed edge as a tuple (i, j, s) € E, with s € {+,—}, it can be indicated whether 
i is activated or inhibited by j. As an illustration, the graph corresponding to the 
cross-inhibition network is shown in figure 7.10a. This simple graph is composed of 
two vertices, a and b, as well as two directed edges. Figure 7.10b shows a slightly 
more complex example of a graph model, added for illustrative purposes. 


= V= {a,b} 
E = {(a,b, —), (b,a, —)} 











Un \ V {a,b c} 
E = {(c,a, +), (6,6, +), (a,¢,-); 
b 5 (a,b, —), (b, a, —), (b, b, —)} 


(b) 
Figure 7.10 Directed graphs representing (a) the cross-inhibition network in figure 7.1, 
and (b) a more complex network, added for comparison. 


The representation of a genetic regulatory network as a graph allows the analysis 
of its structural properties by means of graph-theoretical techniques (Barabási and 
Oltvai, 2004; Newman, 2003). The global connectivity properties of the network 
can, for instance, be described by the average degree and the degree distribution 
of the vertices. The degree k of a vertex indicates the number of edges to which 
it is connected (if necessary, incoming and outgoing edges can be distinguished). 
(k) denotes the average degree and P(k) the degree distribution of the graph. The 
properties give an indication of the complexity of the graph and allow different types 
of graphs, and therefore networks, to be distinguished (figure 7.11). In classical 
random graphs (figure 7.11a), also called Erdés-Rényi graphs, the probability that 
a given vertex has k edges follows a Poisson distribution P(k). That is, the vertices 
typically have (k) edges and the vertices having significantly more or less edges than 
(k) are extremely rare, as shown in part (c) of the figure. By contrast, in scale-free 
graphs (figure 7.11b), the vertex degrees obey a power-law distribution P(k) ~ k77, 
shown in part (d) of the figure. Scale-free graphs are inhomogeneous, in that most 
of the vertices have few edges, whereas some vertices, called hubs, have many edges 
and hold the graph together. 

For values of the degree exponent y between 2 and 3, scale-free graphs have a 
number of surprising properties. First, the average length of the path between two 
vertices of the graph is proportional to loglog|V|, where |V| denotes the number 
of vertices of the graph (Barabasi and Oltvai, 2004; Newman, 2003). This is even 
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(c) (a) 
Figure 7.11 Schematic illustration of the architecture of (a) random and (b) scale-free 
undirected graphs (Bray, 2003). The degree distribution follows (c) a Poisson distribution 
in random graphs and (d) a power-law distribution in scale-free graphs. k denotes the 
degree of a vertex and P(k) the degree distribution. The filled vertices in (b) are hubs. 


shorter than the average path length in random graphs, which scales as log |V| and 
confers on them the small-world property (Watts and Strogatz, 1998). The small- 
world property implies that local perturbations can quickly spread out through the 
entire regulatory network. Second, the presence of hubs makes scale-free graphs 
robust against accidental failures (Albert et al., 2000; Jeong et al., 2000, 2001). 
Whereas randomly removing a certain number of vertices disintegrates a random 
graph, in a scale-free graph this mainly affects the numerous low-degree vertices, 
the absence of which does not decompose the graph. Third, unlike classical random 
graphs, scale-free graphs can possess a modular structure (Ravasz et al. (2002); 
chapter 3). Such graphs are constructed by iteratively combining small and tightly- 
clustered modules of vertices into a hierarchical structure. 

There is now quite some evidence that genetic regulatory networks, and many 
other biological and non-biological networks, are scale-free (Dobrin et al., 2004; 
Featherstone and Broadie, 2002; Guelzim et al., 2002; Jeong et al., 2000, 2001; Lee 
et al., 2002; Maslov and Sneppen, 2002; Tong et al., 2004; Wagner and Fell, 2001). 
Some caution should be observed in interpreting the results though. Because current 
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data on regulatory interactions are incomplete, a subnetwork of the actual network 
is analyzed, which may have a different degree distribution (Stumpf et al., 2005). 
Moreover, the particular graph representation chosen to model the network may 
bias the results, as shown by Arita for the E. coli metabolic network (Arita, 2004). 
In the case of genetic regulatory networks, graph models are usually restricted to 
direct transcription regulation interactions, thus ignoring indirect interactions that 
are mediated by metabolites binding to transcriptional regulators (Alm and Arkin, 
2003). 

Although the analysis of structural properties like the degree distribution may 
yield insights that seem intuitively important for understanding the dynamics of 
a network, it is not so easy to actually pin down their behavioral consequences. 
Some studies have begun to explore the topic, using a combination of graph theory 
and Boolean networks (Aldana and Cluzel, 2003; Fox and Hill, 2001; Oosawa and 
Savageau, 2002), but the relation between global structural properties and network 
dynamics is still largely an open question (Strogatz, 2001). Alternatively, one could 
follow an approach focusing on local structural properties, in particular specific 
patterns of interactions between the network components. In this vein, Thomas 
has conjectured that positive feedback loops in regulatory networks are a necessary 
condition for the occurrence of multiple equilibrium points (Thomas and d’Ari, 
1990), a conjecture that has been proven since by a number of authors (Cinquin 
and Demongeot (2002); Gouzé (1998); Plahte et al. (1995); Snoussi (1998); Soulé 
(2003); see also Remy et al. (2003)). In the remainder of this section, we will 
discuss another example of the latter approach, the identification and functional 
analysis of motifs. 

Loosely speaking, network motifs are recurring patterns of interactions between 
a small number of network components (Milo et al. (2002); Shen-Orr et al. (2002); 
see Wolf and Arkin (2003), for a review). Their functional importance has been 
suggested by the evolutionary conservation of motifs within the yeast protein- 
protein interaction network (Wuchty et al., 2003) and the convergent evolution 
towards the same motifs in the transcriptional regulatory network of diverse species 
(Conant and Wagner, 2003). 

Techniques for discovering motifs consist in the identification of small patterns 
in the graph that are overrepresented when compared to a randomized version of 
the same graph (Milo et al., 2002, 2004b). More precisely, all possible patterns of 
a fixed number of vertices occurring in the graph are enumerated in a first step. 
The statistical significance of a pattern is then inferred from the comparison of the 
original graph, corresponding to the biological network, with a set of randomized 
graphs, in which each vertex has the same number of incoming and outgoing edges 
as the corresponding vertex in the original graph. A pattern is a motif if it occurs 
significantly more often in the original graph than in the randomized graphs. Since 
randomized networks are supposed to be free of any type of natural selection, the 
overrepresentation of the motifs can be assumed to have an evolutionary origin, 
reflecting the importance of the function performed by the motif. This conclusion 
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is sensitive, though, to the particular randomization procedure followed and requires 
careful statistical validation (Artzy-Randrup et al., 2004; Milo et al., 2004a). 

Shen-Orr et al. have searched the transcriptional regulatory network of FE. coli for 
motifs, using information from the database RegulonDB and the literature (Shen- 
Orr et al., 2002). In this network, consisting of 855 genes and 1,330 regulatory 
interactions, three overrepresented motifs have been identified: the feedforward 
loop, in which a transcription factor regulates a second transcription factor and 
both regulate together their target gene (figure 7.12a); the single-input motif, in 
which a group of genes is controlled by a single transcription factor; and the 
dense-overlapping regulons, in which genes and the transcription factors controlling 
their expression form a highly-overlapping structure. The feedforward loop is the 
motif occurring most frequently (40 times) in the E. coli network. This has been 
subsequently confirmed for an extended version of the same network, in which an 
even higher number of feedforward loops have been found (Ma et al., 2004a). 

What could be the functional role of the feedforward loop? In a follow-up 
study the group of Alon has theoretically and experimentally demonstrated the 
information processing task carried out by this motif. Using a differential equation 
model of the feedforward motif, they show that its role might be to filter out 
fluctuations in input stimuli and allow a rapid response when the stimuli disappear 
(Mangan et al., 2003; Mangan and Alon, 2003). Consider the feedforward loop in 
figure 7.12b, where the transcription factors X and Y together activate the gene z. 
When X is active and above a threshold concentration, the input signal activating 
X is transmitted to the output Z through a direct path from X and an indirect 
path from X through Y. Hence, a transient signal is not transmitted to Z, since 
it does not allow the concentration of Y to reach a threshold level high enough to 
stimulate the expression of gene z (figure 7.12c). On the other hand, a persistent 
input signal enables the concentration of Y to rise and eventually allows Z to pass 
its threshold level. The functioning of the feedforward loop is asymmetric, since the 
inactivation of X leads to the rapid downregulation of z. The above predictions have 
been experimentally verified for the L-arabinose utilization system in E. coli using 
reporter genes (Mangan et al., 2003). In this feedforward loop, CRP corresponds 
to the general transcription factor X and AraC to the specific transcription factor 
Y, while z is the operon araBAD. 

The discussion of the feedforward motif illustrates how a clear, well-defined 
function can be assigned to a pattern of interactions that is overrepresented in the 
network. Unfortunately, it is not always possible to make such a straightforward 
connection between structure and function. Usually, motifs do not occur in isolation, 
but rather overlap to generate complex motif clusters (Dobrin et al., 2004). This 
makes it difficult to draw definite conclusions on the function of an individual 
pattern of interactions occurring in a cluster. For instance, it is not obvious that 
the network in figure 7.10b, in which the cross-inhibition pattern is embedded in 
a more complex feedback structure, also possesses the bistability and hysteresis 
properties for a large range of parameter values. In order to establish this, the 
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Figure 7.12 (a) Feedforward loop motif in graph representation. (b) Feedforward loop 
in genetic regulatory network, where it is assumed that both X and Y are necessary for 
expression of z. (c) Dynamic properties of the feedforward loop (Shen-Orr et al., 2002). 
Tz, Ly, and x, denote the concentrations of X, Y, and Z respectively, and 0x, 0y, and 0z, 
their threshold levels. The input signal activates X. 


static graph analysis need to be complemented by a dynamic analysis of the type 
discussed in earlier sections of this chapter. 





7.6 Discussion 


In order to understand how the functions and development of living organisms 
are controlled by the networks of interactions between genes, proteins, and small 
molecules within and between cells, we need mathematical methods and computer 
tools. We have insisted on the demand for qualitative approaches for the modeling, 
analysis, and simulation of genetic regulatory networks, that is, approaches capa- 
ble of inferring properties of the dynamics of genetic regulatory networks that are 
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invariant for a range of reaction mechanisms and values of kinetic constants (sec- 
tion 7.2). The interest in these qualitative approaches derives from the fact that, 
for most networks of biological interest, we do not dispose of detailed information 
on the reaction mechanisms and precise numerical values for the kinetic constants. 
Moreover, it is reasonable to assume that many dynamical properties of living or- 
ganisms are robust to at least some variations in mechanisms and numerical values. 
This does not mean that qualitative approaches always impose themselves: there 
are biological questions for which quantitative precision is required, and there do 
exist systems for which detailed, quantitative information is available. Quantitative 
and qualitative approaches should be seen as complementary rather than mutually 
exclusive. 

In this chapter we have reviewed three approaches for the analysis of qual- 
itative properties of the dynamics of genetic regulatory networks, based on 
increasingly-abstract modeling formalisms: discrete abstractions of differential 
equations, Boolean networks, and graphs. Whereas the first two approaches use 
models that explicitly describe the dynamics of the system, the third approach is 
based on the assumption that an analysis of the structure of the system provides 
useful insights into its dynamics. The structural and dynamic approaches focus on 
distinct, but complementary aspects of the networks, and in practice need to be ap- 
plied in combination. As discussed in section 7.5, the assignment of a function to a 
network motif or module requires tools for studying the network dynamics. On the 
other hand, tools for analyzing the network structure are critical for dealing with 
the problem that the transition graphs generated by the qualitative approaches 
scale exponentially with the size of the network. Instead of studying the dynam- 
ics of very large networks directly, it seems more judicious to distinguish network 
modules, study the dynamics of these modules individually, and then analyze the 
interactions between the modules on a higher level of abstraction, using simplified 
models for each of the modules (chapter 3). 

What are the main future directions for research on qualitative approaches 
towards the analysis of genetic regulatory networks? Of the many challenges that 
could be mentioned, two deserve special attention in our view. The first concerns 
the impact of qualitative modeling of genetic regulatory networks on experimental 
biology. The qualitative approaches have some features that make them particularly 
suitable for the systems studied at the forefront of experimental research, notably 
the ability to deal with incomplete and non-quantitative information. However, 
while many excellent qualitative models have been developed and described in the 
literature, examples of the experimental verification of novel predictions made by 
these models are still relatively rare. 

A second challenge is the development of qualitative methods that allow the 
integrated analysis of genetic regulatory networks and other types of molecular 
interaction networks, such as metabolic and signal transduction networks. In this 
chapter we have focused on the interactions occurring in gene regulation, but some 
of the methods could be applied or extended to the modeling of other types of 
interactions, such as enzyme-catalyzed reactions or protein-protein interactions. 
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The search for motifs in a network composed of transcription regulation and 
protein-protein interactions in yeast is a good example (Yeger-Lotem et al., 2004). 
Alternatively, hybrid approaches could be followed, in which methods adapted to 
the specific problems of each type of network are combined. The use of Boolean 
network models to add the effects of gene regulation to flux balance models of E. 
coli metabolism is a case in point (Covert et al. (2001, 2004); see also chapter 5). 
Both directions are promising but have been little explored thus far. 
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Notes 


1. The resulting partition does not actually preserve the derivative sign pattern, 
as in figure 7.5b, but a finer-grained partition can be formulated for which this is 
the case (Batt et al., 2005b). 

2. Mixtures of synchronous and asynchronous transitions can also be proposed. 
Such transition relations allow some, but not necessarily all, genes to change their 
activity state simultaneously. 

3. The cross-inhibition network in figure 7.1 is an example of a network with a 
positive feedback loop, due to the fact that each protein positively influences the 
expression of its own gene, by inhibiting the expression of the gene encoding the 
inhibitor of its own gene. 


8 Stochastic Modeling of Intracellular Kinetics 


Johan Paulsson and Johan Elf 


Cellular events are triggered by random collisions between molecules. If each type 
of event occurred numerous times per generation, this randomness could possibly 
average out and cells could behave deterministically. But many central cellular 
reactions by contrast occur so infrequently that substantial relative fluctuations 
arise spontaneously. By affecting the rates of other reactions, these fluctuations 
can propagate through networks and spread to any cellular process. The tendencies 
to correct fluctuations also range from strong to insignificant depending on the 
kinetic mechanisms, causing some systems to behave with high precision and 
others to accumulate extreme variability. Many aspects of life in the individual 
cell are therefore best understood probabilistically. This is further supported by 
a rapidly growing body of experimental work. Most macromolecules are found to 
be present in very low numbers per chemical species, and studies measuring single 
cell concentrations almost invariably report large variation from cell to cell. This 
chapter introduces some theoretical aspects of randomness in simple genetic and 
metabolic networks, including both general mathematical techniques and specific 
biological phenomena. 





8.1 Chapter Overview 


The text is organized as follows: section 8.2 discusses the assumptions behind 
stochastic modeling of chemical reactions. Section 8.3 presents a multivariate model 
for stochastic gene expression that can be solved exactly. Section 8.4 gives an 
interpretation of the fluctuation-dissipation theorem (FDT), tailored to biochemical 
processes. Section 8.5 uses simulations and FDT approximations for systems that 
operate near critical points, and section 8.6 shows how such fluctuations can be 
tamed by negative feedback. Section 8.7, finally, gives some examples of noise- 
induced transitions and constructive roles of noise. For Monte Carlo methods we 
refer to chapter 9 and chapter 16 for spatial and nonspatial descriptions respectively. 
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For more extensive theoretical treatments of chemical fluctuations we refer to the 
several textbooks available (Erdi and Toth, 1989; Gardiner, 1985; Keizer, 1987; van 
Kampen, 1992). 





8.2 Basic Models for Stochastic Kinetics 


8.2.1 Differential Equations for Probabilities, Averages, and Macroscopic 
Concentrations 


A starting point for stochastic descriptions of chemical reactions is to define a 
sufficiently complete set of state variables such that changes only depend on the 
current state (Lax, 1960). This could in principle include continuous variables, such 
as temperature, cell age, or volume, but to simplify the notation we here only 
account for discrete jumps corresponding to changes in the number of molecules of 
each species. 

Consider imax different chemical species homogeneously distributed! in a volume 
Q. The state of the system is defined by state vector n = [ni +- ni re where n; 
is the number of molecules of species i. Let there be jmaz types of reactions and 
let reaction j change component i from n; to ni + vij with a rate r; that depends 
only on the current state of the system, n. The probability that reaction j occurs 
in a small time interval At is then r;(n)At. The integers vij form an imax X Jmax 
stoichiometric matrix v where the j:th column v; corresponds to the change in the 
state vectors when a reaction of type j occurs. 

The probability of arriving in state n during a short time interval At is the sum of 
the probabilities for leaving from other states n—v,; to state n in a single reaction, 
At}; 7; (a— vj) P(n— vj, t). Similarly, the probability of leaving state n in this 
time interval is At >’, rj (n) P (n,t). The probability P(n, t + At) to be in state n 
at time t + At is thus: 


P(n,t+ At) = P(n,t)— At) rim) P(n) + At) | rim- 95) P(n— v) 











leaving arriving 
(8.1) 
Rearranging equation 8.1, dividing by At and taking the continuity limit At — 0 
leads to a time-continuous state-discrete Markov process—the master equation for 
the system of chemical reactions (Singer, 1953; van Kampen, 1992): 


M 
Pent) D a-r) P(n — vj, t) — r;(n)P(n, t)) (8.2) 
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The motion of the averages (n;) can be postulated directly from the rates and event 
sizes: 


d(ni) _ 
a= 2 vij (rj (n)} (8.3) 





The corresponding macroscopic concentrations x; are the number of molecules per 
unit volume in an infinitely large system, that is, a system where the rates and 
the volume go to infinity in such a way that x; converges. For many processes this 
could be at least hypothetically achieved by taking an infinite population of cells, 
removing walls and membranes, and keeping the remaining cell components well 
mixed, similar to some in vitro experiments. In practice, however, rate equations 
dzi/dt are typically constructed directly from first principles and often used to ap- 
proximate average cell behavior. For nonlinear systems it must then be remembered 
that there are qualitative differences between true averages and their macroscopic 
idealizations (Bharucha-Reid, 1960; Renyi, 1954), something we will return to in 
section 8.7. 


8.2.2 Simulating Paths of the Master Equation 


From the analysis above equation 8.1 we saw that the probability that some reaction 
will occur in a short time interval At is the sum of their individual probabilities, 
At 2S rj (n) = Atro. Let p(t) be the probability that the system has not left state 
n at time t, given that it was in state n at time t=0. The change in p(t) between 
time t and t + At is then Ap(t) = p(t + At) — p(t) = —rop(t)At. Taking the 
limit of continuous time, At — 0, and solving the resulting differential equation 
dp(t)/dt = —rop(t), gives p(t) = exp(—rot). The probability that a reaction has 
occurred at time t is thus F(t) = 1 — p(t) = 1 — exp(—roft), that is, the system 
resides an exponentially distributed time in state n, with an average (t) = 1/ro. 
The probability that the first event is reaction j is in turn given by its relative 
contribution to the total rate, Pr (reaction, |any reaction) = r; (n)/ro. 

This defines a simple algorithm for generating individual paths of the random 
process: pick the next reaction time from an exponential distribution and choose 
event type according to the fractional rates. Physical or chemical considerations are 
then only important when choosing what states and jumps to include to make the 
description Markovian. The algorithm itself is indistinguishable from the definition 
of a homogenous time-continuous state-discrete Markov processes (Doob, 1945). 
Daniel Gillespie—who effectively pioneered its use for generating sample paths of 
chemical reaction networks (Cao et al., 2004b; Gillespie, 1976, 1977)—+efers to it 
as the stochastic simulation algorithm in chapter 16, but here we will call it the 
Gillespie-Doob algorithm. 
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8.2.3 Elementary Reactions 


All kinetic modeling relies on condensing fast transitions between short-lived physi- 
cal states into single reaction steps. The most common example is the approximation 
that a system is “well-stirred,” where bimolecular reactions are described without 
accounting for spatial positions. This essentially assumes that the molecules in- 
volved in a chemical reaction have time to diffuse over the whole volume before 
they are likely to be involved in a reaction again (Gardiner and Steyn-Ross, 1984). 
Descriptions of unimolecular reactions may similarly assume rapid internal equili- 
brations, so that transitions to functionally different states again effectively behave 
as if they had no memory. Because these assumptions are so common and more or 
less similar from system to system, they are often considered “elemental.” However, 
the key assumption that the time-scales are separated can be equally true for more 
complicated chemical reactions. For example, transcription involves an enormous 
number of small steps, yet for many purposes the whole process of making RNA 
molecules could possibly be approximated as Poissonian, that is, with exponentially 
distributed dwell times between births of new RNA molecules. 

Complicated reactions that can be represented by a single step are said to be 
“elementary complex’”—elementary because they effectively behave as elemental 
reactions on the time-scale studied, and complex because they could be broken 
down into several more elemental reactions (Keizer, 1987). For example consider a 
protein that rapidly equilibrates between two conformations 


AA 
Aj = Ap 


such that it is in conformation Ay during p = Aq (Ay + ds) percent of the time. If 
the Ag conformation participates in another reaction Ag >, B which occurs on a 
much slower timescale than the conformational changes in A, then this reaction can 
be considered elementary complex with rate A3pn4 where na is the total number 
of A molecules. For the purpose of modeling changes in B it is then not necessary 
to include the two different conformations of A. 

In some cases it is important to also account for the fact that molecules involved 
in intermediate states of complex reactions cannot participate in other reactions. 
For example, let a protein autorepress its own transcription such that active genes 
are repressed with rate An and inactive genes are derepressed with rate \,. Each 
gene then switches on and off as 


on == off 
1 
If these reactions equilibrate rapidly compared to the other reactions, it is again 
tempting to simply assume that genes are active for p = Ay (Ai + Asn) percent 
of the time, ignoring the details of binding and unbinding. However, it may also 
be important to account for the fact that bound repressors are unavailable for 
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other reactions, which can have large and qualitative effects on the dynamics (Berg 
et al., 2000b). That being said, it should also be emphasized that simplifications 
of this type are often necessary to make the models tractable, and they should 
not be avoided out of a superstitious fear of condensations. Only allowing uni- or 
bimolecular reactions can easily give an unwarranted impression of legitimacy: even 
in the simplest and best characterized cellular systems, it is simply not the case that 
we can make accurate quantitative assumptions with any confidence. When running 
the substantial risk of leaving out critical variables and reactions, the possibility of 
explicit simplifications that greatly facilitate modeling is one of the few blessings 
we can count. Increasing the state space will make a system less transparent and 
its dynamics less intelligible, which in turn makes it more difficult to identify 
gross inaccuracies in the assumptions. Furthermore, just like seemingly complicated 
reactions can behave as if they were elemental, seemingly simple reactions can hide 
a non-trivial behavior. For example, first-order unimolecular reactions are often 
automatically assumed to be elemental, though in many cases there are long-lived 
intermediate states such that the transitions are not memory-lacking (Xie, 2002; 
Yang et al., 2003). 


8.2.4 Brief History of Stochastic Modeling of Chemical Reactions 


Stochastic modeling of chemical reactions has come a full century from the first 
studies of Brownian motion (Bachelier, 1900; Einstein, 1905). Models of fluctuating 
concentrations in turn date back to the 1930s (Leontovich, 1935) and were soon fol- 
lowed by biological applications. Several theoretical studies in the 1940s emphasized 
the intracellular randomness associated with small numbers: The “1 iy VN-rule” of 
relative fluctuations at equilibrium (Schrödinger, 1944) influenced generations of bi- 
ologists, and autocatalysis was shown to further amplify variation (Delbruck, 1940). 
The 1950s saw the first experimental analyses of heterogeneity in bacterial gene ex- 
pression (Benzer, 1953), and after a decade of focusing more on stochastic enzyme 
kinetics (Bartholomay, 1962), theory for stochastic gene expression was developed 
in some detail (Berg, 1978a; Rigney, 1979a,c; Rigney and Schieve, 1977). Both the- 
oretical and experimental efforts intensified in the 1980s and 90s, but it is only in 
the last five or ten years that the field has truly taken off. This is largely due to 
the possibility of systematic quantitative studies of protein fluctuations using green 
fluorescent protein (GFP) (Elowitz et al., 2002; Ozbudak et al., 2002), but also to 
a wider appreciation of the stochastic foundation of kinetic theory. 





8.3 Stochastic Gene Expression 


Gene expression is stochastic by nature: genes are activated and inactivated by 
random association and dissociation of repressors or transcription factors to DNA, 
transcription of a specific gene often occurs a few times per cell cycle, and many pro- 
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In addition to spontaneous Poisson fluctuations (first term), proteins are also 
randomized by mRNA fluctuations (second and third terms), that follow: 
ge = 1 _1— Pon Tı 86 
aa U ( ë ) 
(n2) (n2) (nı) nT 
The first term of equation 8.6 again reflects Poisson fluctuations, now coming from 
the random births and deaths of individual transcripts. The first factor of the 
second term can in turn be interpreted as the normalized stationary variance in the 
(binomially distributed) number of active genes: 





o? 1 A 1l-Pon 


(my nP ë (m) 





(8.7) 


where Psn = Aj / (AT + Aj) is the stationary probability that a given gene is on. At 
any given average, stationary fluctuations are smaller than Poissonian because the 
total number of genes is fixed. The second factor of the second term of equation 8.6 
comes from time-averaging and can be explained by solving the second average 
equation in equation 8.4 for fixed (nı) 


(22) 4,44. — (N2) oo = ((n2) 4, — (M2) 90) ees (8.8) 
—1———— m — mm 

Deviation from stationary Deviation from stationary 

average at time t = tı + t2 average at time t = tı 


This means that ng exponentially forgets its initial conditions with rate 1/72, that 
is, events that occurred more than T2 time units ago are almost forgotten. The same 
principles apply to nı and the ratio 72/7, thus determines how much the number 
of active genes changes within the kinetic memory of the mRNA concentration. If 
T2/T1 is large, the time-averaging factor in equation 8.6 is close to zero, reducing 
mRNA fluctuations just like throwing many dice reduces relative fluctuations in 
the total outcome. These principles also apply to proteins in equation 8.5. The 
second term comes from time-averaged spontaneous mRNA fluctuations, and the 
third term comes from low-copy gene fluctuations that are first time-averaged by 
mRNAs and then by proteins. 

Many processes including fluctuations in the protein synthesis machinery (Elf and 
Ehrenberg, 2005a), feedback regulation of transcription and translation (Becskei 
and Serrano, 2000; Swain, 2004; Tomioka et al., 2004; Elf and Ehrenberg, 2005b), 
localization of transcription factors, as well as controlled transport, maturation, 
folding, and degradation of mRNA and protein could similarly affect the rate of gene 
expression and fluctuations in protein concentrations. To make the results accessible 
in the current space, these effects are ignored above by implicitly absorbing all other 
processes into effective rate constants. 

For readers interested in the original literature, we recommend the pioneering and 
excellent papers by David Rigney, Otto Berg and colleagues (Berg, 1978a; Rigney, 
1979b,a,c; Rigney and Schieve, 1977) as well as the numerous other studies that have 
appeared since (Kepler and Elston, 2001; Paulsson, 2004; Paulsson and Ehrenberg, 
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2001; Peccoud and Ycart, 1995; Raser and O’Shea, 2004; Sasai and Wolynes, 2003; 
Swain et al., 2002; Tapaswi et al., 1987; Thattai and van Oudenaarden, 2001). 





8.4 Fluctuation-Dissipation Approximations 


Most models of stochastic reaction networks include nonlinear rates and can there- 
fore rarely be solved exactly. They can still be simulated using the Gillespie-Doob 
algorithm described in section 8.2 and chapter 16, but each numerical simulation 
only shows the behavior of a single trajectory for a single combination of pa- 
rameters. Simulations are therefore easier to evaluate if they are complemented 
by more generic approximations. To exemplify straightforward interpretations of 
generic approximations, we here discuss a nonequilibrium version of the fluctuation- 
dissipation theorem (Keizer, 1987; Lax, 1960) (FDT). This states that the matrix 
of covariances o (with notation oj; = o?) follows: 


— = Ag+cA™+({B) (8.9) 


where “drift” matrix A reflects the dynamics for relaxation to steady state and 
“diffusion” B reflects the randomness of the individual events. This equation is used 
under different names in many areas of study and it can be derived in many ways 
(Elf and Ehrenberg, 2003; Gardiner, 1985; Keizer, 1987; Lax, 1960; van Kampen, 
1992). However, it is always assumed, explicitly or implicitly, that the responses in 
the reaction rates can be linearized in the parts of state space that are reached by 
fluctuations. 

To define A, let Jj* = >), vijrj (n) be the total flux of component i at state n 
where reaction number j occurs with rate rj, producing vij molecules of species 7. 
The averages then exactly follow 


d(n) tot — 
Ala _ (git) = (+) (I) (8.10) 


where JF and J; are the total production and elimination fluxes of species n;. The 
covariance matrix in turn exactly follows 


do 

dt 
where a; = ni— (ni) is the displacement from the average and Bj, = ss, VijVkjrj(n) 
(see below). The FDT formulation in equation 8.9 follows from equation 8.11 by 
approximating the flux J as linear in n, that is, by Taylor expanding the rates 
around the average. That is not a trivial procedure though. When J depends 
nonlinearly on n, the equations for the average dynamics cannot even be solved 
exactly for the steady state because (J (n)) Æ J ((n)). One approach (van Kampen, 
1961) solves this problem by approximating fluctuations close to the macroscopic 


= (Fetal) cs (are) + (B) (8.11) 


limit where numbers are large and fluctuations are small, that is, starting with 
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rate equations. A related but slightly different approach is to simply make the 
direct mean-field approximation (J (n)) ~ J ((n)) that the average rate is the rate 
at the average number. The difference between the two approaches is clear when 
approximating the average rate of bimolecular homodimerization, r~ = An (n — 1): 


mean-field approx. 


: TE A ((n)? = (n)) 
(r-) =A(n(n—1)) =A((n) +o? — (n)) x ; (8.12) 
à (n) 
a 
macroscopic approx. 
Both ignore the variance, but only the latter additionally assume high numbers. 
Using either method in equation 8.12, the dynamic matrix A can then be calculated 
as the Jacobian matrix of the average dynamics: 


Ay a Ode (a) ðJ; ((n)) 
a a (nk) a (nk) 


This is a measure for how the fluxes are affected by changes in n and thus 
summarizes, in an approximate and local way, how fluctuations are amplified or 





(8.13) 


corrected. Neither the mean-field or macroscopic method necessarily provides a 
good approximation, though, and when applied to specific nonlinear schemes they 
should always be checked numerically. 

Matrix B in the exact equation 8.11 can be similarly approximated using 


(B(n)) ~ B({n)): 
(Bik) © Dy VijVkjtj ((n)) (8.14) 


This is a measure of the size and frequency of the random events that introduce 
fluctuations in the first place. The FDT thus captures how the overall variability 
of the system as measured by o depends on fluctuations introduced in the diffusion 
matrix B and the dissipation of fluctuations introduced in the Jacobian matrix A. 


8.4.1 Interpreting the FDT in Terms of Physical Observables 


The approximations above may or may not be accurate, but greatly facilitate 
first order approximations. Calculating stationary variances is now technically 
straightforward, and the conceptual challenge lies in interpreting the results in 
terms of general physical principles. To facilitate interpretations, the stationary 
FDT can be reinterpreted in terms of more straightforward physical properties, 
following Paulsson (2004, 2005) and starting with: 


Mn+nM* = D (8.15) 
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where: 


„= oda, ad ppo Pe 
a Gag ee a 


The normalized Jacobian matrix M can be further rewritten by using the rules 


(8.16) 


for differentiation of logarithms: 


Olnf «Of ea Oln(f/g) _OlMf  Olng 
Olna før n ðlnz Olnx Olna 








(8.17) 
Applying these to equation 8.13 and using the steady state condition Cn ) = 
(J7 Y} = (Ji) gives: 


den 2 Ost) OF) (i) (OmUsT) ani) 
a g E alln) On (ng) 


aln (ITY (It) (8.18) 








= ) 
) 


To reduce notation we here use the true averages (J; } rather than J,‘ ((n)), but 
interpret the matrices in the macroscopic limit where they are interchangeable. 


Matrix M now becomes 


(Fi) On (I) / (Fe) 





(8.19) 


At steady state, the average degradation (or synthesis) rate per molecule is approx- 
imately equal to the inverse of the average lifetime 7;: 
- + 
(i) _ it) _ i) 1 
= = N (8.20) 
(ni) (mi) (nmi) ti 
This is only exact for exponential first-order decay and approximate for all nonlinear 
degradation mechanisms. However, it is not an additional approximation. As shown 





above, one version of the stationary FDT approximation evaluates all parameters 
at the macroscopic steady state, that is, in the hypothetical limit where each 
molecule is immersed in a constant environment of other concentrations. Within 
this approximation even nonlinear degradation mechanisms perfectly mimic first- 
order exponential decay at steady state. Matrix M thus follows: 


f = T 
yen whee Him oS) (8.21) 
Ti O ln (nx) 





The H parameters are logarithmic susceptibilities or elasticities and measure 
how the birth-to-death ratio is affected by concentration changes: If H;,=2, a 
1% increase in ng will approximately cause a 2% increase in the degradation rate 
relative to the synthesis rate of ni. 
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A full reinterpretation of matrix D for arbitrary chemical events will be published 
separately. Here we restrict the analysis to nonlinear versions of cases where each 


chemical event adds or removes a single molecule of a single species? : 


= -~9 a and = Dix = 0 fori #k (8.22) 


Using equation 8.20 we then get 





ii 





(8.23) 
Within the approximation, the randomness introduced by the it” component is thus 
inversely proportional to its average copy number,(n;). That does not mean that 
the actual fluctuations in the 7” are inversely proportional to (n;)—the final effect 
of the probabilistic events is filtered through the dynamic responses. 


8.4.2 Examples of Elasticities 


A univariate example illustrates the basic principle: 


aln (AT (n) / AF (n1)° 
=A (m) -A (m) > Hu= ( ah ) =p-a 
(8.24) 


The univariate elasticity thus equals the difference in effective kinetic order of the 
degradation and synthesis fluxes. If the stochastic process is compared to a random 
walk in a valley, the elasticity estimates the normalized steepness of the walls. For 
an unbiased and unbounded random walk in one variable, H11= 0 and the dynamics 
is neutrally stable. The multivariate cases are equally simple. Excluding genes from 
the gene expression model above (and shifting the indices so that ng becomes nı 
and ng becomes nz), the mRNA-protein part gives: 


dima) — Af — dT (ni) mem È d 
-1 1 


d (nı) 
dt 











(8.25) 





where 7; = 1 f A; - Both the uni- and multivariate examples above are particularly 
simple because both synthesis and degradation follow power-laws, but many more 
complicated mechanisms are also easy to evaluate using the definitions in equa- 
tions 8.18-8.21 or simply eyeballing lower and upper bounds. 


8.4.3 A Generalized Pseudo-Bivariate Example 


Rewriting the FDT in terms of physical observables greatly facilitates interpretation 
and makes it possible to collectively address families of dynamic processes without 
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losing interpretability. For example consider the following extension of the mRNA- 
protein model: 








F n 
nı Ji 1) nı +1 
J; (nı) n = 
m Y E alni) = Cy (n1)) = (Ji (n1)) (8.26) 
+ i _ . 
ng É (maina) no +1 —— = (Ja (nı, n2)) (J; (m1, n2)) 


Jy (1,2) 
ng  — no-l 


Because the second variable does not affect the first, this is a pseudo-bivariate 
stochastic system. According to the FDT approach above any process with a stable 
fixed point then has: 


H. 0 2 0 
yash Aun ae pal ee (8.27) 
Fq1/T2 2/72 0 2/({n2) T2) 

Solving equation 8.15 gives 


spontaneous or intrinsic x2 noise 





o3 1 z d F 
iho = z% —— 2. 
(n2) ? (n2) Haz 
we ER 
low-copy effective 
fluctuations stability 


forced or extrinsic £ noise 








oy H3, Hə2/T2 
2 x H2 x (8.28) 
(nı) H32 Hə2/T2 + Hii /71 
Êv =~ eS 
environmental static one-step 
fluctuations susceptibility time-averaging 


where mı = o? (nı)? ~ (nı) * Hi.. The intrinsic noise term comes from the 
spontaneous randomness of Xə itself, introduced by element D22. Its first factor 
represents population smallness—each birth and death event has a larger relative 
effect in a smaller population. The second factor of the first term represents the 
dynamic response to perturbations and can be interpreted in several different ways. 
Normalized deviations Az from steady state in the corresponding deterministic 
system follow 





ðAŽŤ - Hu Q i 





Parameter Hə2/Tə2 is thus the adjustment rate constant to steady state following a 
perturbation. The rate can be changed in two ways: by changing the nonlinearity 
as measured by Hə or by changing the average lifetime T2. However, as seen in 
equation 8.27, the latter would also affect the rate of spontaneous randomization 
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are the normalized rate constants for adjustments to steady state in concentrations 
x, and zə respectively (—Hy1/7 and —H 2/72 are the eigenvalues of M). The ratio 
2/72 
Ayi/t1 





(8.34) 


is thus a measure for how rapidly x2 changes relative to xı. Comparing with 
equation 8.28 illustrates a qualitative difference between spontaneous and forced 
fluctuations. If the fluctuations come from the randomness of the environment, rapid 
adjustments simply make the system more responsive to underlying fluctuations. 
In other words: The current state of a more slowly adjusting system depends on a 
longer history of ups and downs in the environment that then partially cancel out. 
Increasing parameter Həə thus has several effects: increasing the tendency to re- 
turn to a preferred average, reducing the susceptibility to permanent changes in 
the environment, and increasing the temporal responsiveness. The susceptibility 
decreases quadratically with H22 while the temporal responsiveness at most in- 
creases in proportion to H22, so the net effect should be lower noise. This is not 
fully general though. In feedback systems with lags or delays, the temporal response 
factor can increase more than quadratically in H22, so that a higher H22 can in- 
crease total noise and cause oscillatory responses to perturbations (for experimental 
observations of oscillations in feedback systems, see Lahav et al., 2004). 





8.5 Fluctuations near Critical Points 


From the FDT analysis above we saw that the size of the stationary fluctuations 
depend on two opposing forces: the turnover of molecules in random events that 
contributes to diffusion in state space and the rate of relaxation back towards the 
average. These principles are illustrated in figure 8.1 At thermodynamic equilibrium 
these forces are fundamentally coupled such that for each substance the variance in 
the number of molecules is smaller than or equal to the average (as in a binomial 
distribution). However, in systems away from equilibrium the flux can be large 
although the rate of relaxation to steady state is small and vice versa. That means 
that relative fluctuations can be arbitrarily large even if the number of molecules 
is high, or arbitrarily small even if the number of molecules is low. 

For univariate linearized systems where one molecule is synthesized or consumed 
per reaction, the stationary FDT boils down to 

2 B Jr J~ 


P= A T (8.35) 














where J* are the total stationary fluxes of either production or elimination and A 
is the rate of relaxation back to steady state. 
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high flow low flow high flow low flow 
slow relaxation slow relaxation rapid relaxation rapid relaxation 


J*(n)t 


—<——<———_ 








p(n)} 








d<n>/dt=<J" (n)-J-(n)> 
Figure 8.1 Flow, relaxation, and fluctuations. The figure illustrates in a univariate 
example the four different combinations of high/low flux and fast /slow relaxation to steady 
state. Near-critical fluctuations arise when the rate of relaxation is low at the same time 
as the flow is relatively high. 


In normalized variables the same relation takes the form 


a (8.36) 


Poissonian-sized fluctuations with o? = (n) is thus only obtained when the turnover 
of the pool J/(n) is equal to the rate of relaxation, —A, or, equivalently, when the 
elasticity H=1 in equation 8.36. Here we consider some simple kinetic systems 
that operate near dynamically unstable points (H ~ 0 in univariate systems) and 
therefore display large fluctuations. 


8.5.1 Autocatalysis 


Consider the autocatalytic system with three reactions 
At Aon un 

n—>n+1 n&sn+1 n= n-l1 (8.37) 

Because the rates depend linearly on state, the time evolution of the average value 


is given by the exact 


AO L Ma + (Aa — pi) (n) (8.38) 
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For u > Az the system has a stationary state with average (n) = à1/(u — Az). The 
rate of relaxation to steady state is 4 — A2 and the average flow though the system 
is u(n) = uàı/(u — Ag). The steady state elasticity is 





ð In (u (n) /(Ar + Az {n))) À H-A 
On (n) ài + Ag (n) u Ae) 
The FDT in turn gives 
2 
7 E = (8.40) 


m= a > aa a 


This expression is exact, as can be shown by calculating the full stationary distri- 
bution (Gardiner, 1985) or by noting that the reaction rates are linear in n. 


8.5.2 Covalent Modification-Demodification Cycles 


Assume that a substrate is converted from unmodified state X, to modified state Xə 
by one enzyme, and back again by another enzyme. With a constant total number 
Nmax Of modified and unmodified molecules, the state of the system is described by 
the number n of X; molecules. The reaction scheme is 

(Oe. a See (8.41) 
Here we assume Michaelis-Menten type reactions, where J~(n) = kın/(n + K) and 
JIT (nmax — n) = ko(mmax — n)/((nmax — n) + K). The Michaelis-Menten approxi- 
mation relies on condensations of several elemental reactions, and was originally 
derived for macroscopically large systems. The stochastic behavior (Bartholomay, 
1962) can be different, so this should only be considered a first approximation. 
However, it still accounts for the major dynamic effect of first order degradation at 
low n and saturation at high n. 

The modification-demodification cycles in equation 8.41 can display so-called 
zero-order ultrasensitivity if Nmax > K, such that both enzymes can be saturated 
for kı = k2 = k (Goldbeter and Koshland, 1981). Ultrasensitivity refers to the 
fact that a small fractional change in the rate constant k for modification rates 
makes a large difference in the fractional level of modification. A slight increase in 
an enzyme level can thus push almost all molecules into one state or the other. As 
a first approximation, the FDT method gives o?/(n) = H~'= dln (n)/dln k, that 
is, the ultra-sensitive response to changes in k is again tightly connected to large 
random fluctuations. For more details and exact theoretical expressions see Berg 
et al. (2000a). Large fluctuations in modification-demodification cycles were also 
recently experimentally observed by Korobkova et al. (2004). 
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8.5.3 Multisubstrate Reactions 


Consider the anabolic reaction scheme with two species and the following transi- 
tions: 


the Se fig > ag eT 
ny E =] no > no— 1 (8.42) 





] ninaka i 


[n1, n2 nı — 1, n2 — 1] 


It is possible to treat more general cases analytically (Elf et al., 2003), but here we 
only consider situations where the consumption flux of the bimolecular reaction is 
much higher than the first order consumption events, that is un, ~ une & konyne. 
The averages follow 


d (nı) /dt =k= ko (nina) — H (nı) 


(8.43) 
d (ng) /dt =k—- ko (nına) — Hu (n2) 


where (nina) ~ (n1) (nz) in the mean-field or macroscopic approximation. When 
u > 0 this system has one attracting steady state where (nı) = (n2) = 
\/k/ko + p? /k3 — u/2k2 œ~ \/k/k2. For the linearized dynamics around steady 
state, the Jacobian matrix has a slow eigenvalue u which determines the rate of 
relaxation to the steady state. In the limit u — 0 the single attracting steady 
state bifurcates into a curve of steady state points satisfying kənınə = k. Thus in 
the limit u — 0 the fluctuations become macroscopically large, as the system can 
diffuse freely on the curve of stationary states (Elf and Ehrenberg, 2003). When 
0 < p< Vkk the proximity to the critical point makes the fluctuations large in 
any system of finite size. 

The FDT approximation can also be applied to the system, but in this case it is 
advantageous to first make a linear transformation of the variables, such that the 
new stochastic variables correspond to fluctuations in the two perpendicular eigen- 
directions of the linearized system. In the slow eigen-direction, [1 -1] corresponding 
to the eigenvalue, u, fluctuations are large and slow compared to the fast and small 
fluctuations in the perpendicular direction |1 1]. By combined application of FDT 
on the two separated timescales an accurate analytical solution can be obtained (Elf 
and Ehrenberg, 2003). Here we will only focus on the large and slow fluctuations 
in the variable w = n1—ng. In this variable the state transitions and their rates are 


k k 
w w+1 w w 
w 





HNI HN2 
w > wW 1 > W 





1 
A (8.44) 


The macroscopic rate equation for w is d (w)}/dt = k— k+ u (n2) — u (nı) = —u (w). 
That is, deviations in the difference decay exponentially with rate —A,, = p. In 
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steady state (n2) = (m1) = (n) and the fluxes are Jt = J7 = J = k + u (n). The 
FDT approximation for the variance in w then gives 
k+k+uni+un k 


= mF = + (n) (8.45) 





0 


The fluctuations in the original variables nı and nz are o2 > o2 /4, were equality 
holds if the fluctuations in nı and ng are perfectly anti-correlated. The relative 
fluctuations in the individual pools are thus o? / (n)? > k/4u. This result indicates 
very large fluctuation as shown in figure 8.2. The variance in w can also be 
calculated exactly (Elf, 2004) using moment generating functions with the result 
that o2, = k/u + (n). The FDT result equation 8.45 is therefore exact because we 
could change variables so that the rates are linear in the new variables. 
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Figure 8.2 Near-critical fluctuations. The system is simulated using the Gillespie-Doob 


algorithm. The large fluctuations in nı and nəz are anti-correlated such that kanına & k. 
Parameters: k = 600s~1, ko = 0.001s~', and u = 2- 1074s7t. 


The large fluctuations can in this case be explained by a multivariate version 
of the argument behind the zero-order behavior above: an increase in nı is com- 
pensated by a decrease in n2 such that the total flow is unchanged. The same 
phenomenon is predicted to occur in the pools of aminoacylated tRNA used as sub- 
strates in protein synthesis. Here many different concentration combinations give 
the same total rate of protein synthesis, such that the fluctuations in the individual 
ternary complex pool can be very large (Elf and Ehrenberg, 2005a). 





8.6 Negative Feedback of Replication Control 


In some systems—including replication control, cell division, and central metabolic 
pathways—fluctuations pose a threat that cells must carefully eliminate. The most 
studied mechanism for noise suppression is perhaps negative feedback control, 
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where a random fluctuation to a higher concentration leads to lower synthesis rate 
and forces the system back towards the average. Negative feedback is particularly 
effective in systems that otherwise would display near-critical dynamics (Elf et al., 
2003; Paulsson and Ehrenberg, 2001), but its effects can also be significant in other 
types of systems (Becskei and Serrano, 2000; Lahav et al., 2004; Swain, 2004). 
Here we will focus on plasmid replication control as a model system but keep the 
arguments general. 
Plasmids are unrivaled model systems for noise suppression for several reasons: 


1. The average plasmid loss rate—the risk that a plasmid-containing mother cell 
gives rise to a plasmid-free daughter cell—increases drastically with random fluc- 
tuations. 


2. Plasmids self-replicate and would thus generate enormous fluctuations without 
negative regulation, causing both high plasmid losses and slowed growth. 


3. Most plasmid species have copy numbers of about 1-100 per cell, such low 
numbers that spontaneous fluctuations could be substantial. 


4. Numerous plasmid replication control systems only include two or three gene 
products and have been as well characterized as À phage. They also tend to be 
more independent of background processes than almost any other cellular network. 


For many plasmid species, an increase in the number of plasmid copies (n1) 
increases the average synthesis flux (J) of a replication inhibitor (nz) and thereby 
decreases the plasmid replication flux (Jj). The average dynamics can often be 





modeled by: 
d (nı) d (ng) E Z 
ae (J (n1,n2)) — (Jy (m1)) and E (JF (ni n2)} — (Jz (n1, n2)) 


(8.46) 
Both Ir and J; are assumed proportional to nı because every plasmid copy can 
self-replicate (one molecule at a time) and because plasmid segregation in growing 
populations can be qualitatively approximated by first-order degradation (again 
eliminating one molecule at a time). Further assuming that J] and J} mono- 
tonically decrease and increase with nz and nı respectively, the equations have 
a stable steady state (Kurosawa et al., 2002) around which we can approximate 
stationary variances using the fluctuation-dissipation theorem. To reduce the alge- 
braic complexity we will here assume that the inhibitors are present in such high 
numbers that they do not contribute their own plasmid-independent fluctuations 
through noisy signaling (D22 = 0). This is not always a reasonable approximation 
and it will certainly hide some interesting principles. However, a full analysis will 
be published elsewhere, showing the effect of both noisy signaling and fluctuations 
in environmental variables. 
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Without any further assumptions or restrictions, the reinterpretation of the FDT 
in section 8.4 then leads to 


o? 1 Hoz Ta 1 
mı = z% x H 





(nı) (nı) |Hi2Hə1| Tı H22 
1 1 Ti + Tə 
~x x — 8.47 
(nı) G Ti ( ) 
eed x 
correcting inverse effect 
fluctuations of time-averaging 


where 74 is the generation time of the host cell and Tə is the average lifetime of 
the inhibitor (including dilution effects). The compounded parameters are defined 
by G = —Hi2H21/Hə2, Ti = 71/G, and Tz = T2/Hə2. A high negative Hj; means 
that a relative increase in plasmid concentration gives a high relative increase in the 
inhibitor synthesis rate. However, plasmids affect inhibitors by encoding their genes, 
so there are few exceptions to Jo = kn f (n2) for which Hj; = —1. This is not only 
true for constitutively expressed genes: if inhibitors would feed back on their own 
synthesis, that would instead affect Hə2. A low H22 means that the ratio Jy / de 
is insensitive to changes in n2, so that the steady state average of ng conversely 
is sensitive to Jj /Jf. If inhibitors were made at constant rates per plasmid 
copy and decayed exponentially, corresponding to Je = kn, and Jy = ng2/T2, 
then Həə =1. If inhibitors instead were degraded by enzymes that operate close 
to saturation, Hz could be arbitrarily close to zero (zero-order ultrasensitivity 
(Elf et al., 2003)), and if inhibitors autorepressed their own synthesis, H22 could 
be arbitrarily increased. A high Hı means that an increase in the inhibitor 
concentration sharply turns off replication. Plasmids use numerous strategies to 
increase H12, including multistep proofreading control, inhibitor multimerization, 
and cooperative binding of inhibitors where Hj is the Hill coefficient far from 
saturation. 

The compounded parameter G = —H 2H) /H 22 is the total sensitivity gain over 
the feedback loop. If G = 3, then a 1% change in the plasmid concentration would 
eventually lead to a 3% change in the plasmid birth-to-death balance. The time 
constants Tı = 71/G and Tz = T2/H22 determine how rapidly the plasmid changes 
and how rapidly the inhibitor adjusts to the plasmid. 

This is best illustrated by some examples. Several plasmids have been described 
by 








d (nı) K d (n2) 
= —— )- d = — 8.48 
FF 1A" ae pH (nı) and — 7 2 (m1) — H2 (n2) (8.48) 
In the first equation, A; is the frequency with which each plasmid copy attempts 
to initiate replication, h is the Hill coefficient of inhibition, and jz; is the dilution 
rate due to cell growth. In the second equation, Àz is the per plasmid inhibitor 
synthesis rate, and u2 is the sum of the dilution rate due to cell growth and the 
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inherent inhibitor degradation rate. The parameters in equation 8.48 relate to the 
parameters in equation 8.47 as: 


—1 
Ti = Hi Hə = -Ha = 1 
pete and He =n =n J) (8.49) 
7 2 12 K+(n1) Hı 1 


If the inhibitors instead were degraded by a Michaelis-Menten mechanism, as 


d (n2) 
dt 





= Xz (n1) — Ha (N2) — pe (ga) (8.50) 
then the effective H22 can be substantially lower. If the enzymatic degradation term 
dominates over the first-order term, and the enzymes operate close to saturation, the 
inhibitor displays zero-order ultrasensitivity (see section 8.5) (Elf et al., 2003) and 
Həə approaches zero. This could reduce plasmid fluctuations greatly: a small relative 
increase in the plasmid concentration will produce an enormous relative increase 
in the quasi-steady state of the inhibitor, which leads to an enormous relative 
decrease in the plasmid replication frequency. However, as seen in equation 8.47 
increasing Ha too much will eventually increase fluctuations. This is because 
a higher Hoe will also slow down the dynamic response of the inhibitor, i.e., it 
will take more inhibitor lifetimes Tə before the inhibitor adjusts to its new quasi- 
steady state after a change in plasmid concentration. Parameter G only represents 
the effective gain over the loop if the inhibitor response is fast. If the inhibitor 
has a finite response time, it will instead lag behind and depend on the history 
of plasmid concentrations. The current inhibitor value is thus an effective average 
over a history of plasmid concentrations—just like protein fluctuations average over 
gene or mRNA fluctuations in equation 8.5—and therefore tends to display smaller 
relative deviations from steady state. In other words, the inhibitor underestimates 
the deviation from steady state and the response is weaker than it otherwise would 
have been. The inhibitor time-averaging thus increases plasmid fluctuations. This 
can be thought of in terms of corrections: tighter feedback loops (higher G) correct 
spontaneous fluctuations more rapidly, but if inhibitors lag behind the corrections 
are slowed on average. 

The analysis above illustrates a few principles that are common to negative 
feedback systems—the importance of zero-order effects, sensitivity amplification 
and time-lags. However, numerous other principles are not accounted for in the 
analysis above. The fact that inhibitors are made and degraded by inherently 
stochastic mechanisms will produce a signaling noise that can enslave the plasmid. 
In many systems, negative feedback can thereby increase fluctuations by introducing 
its own randomness. For example, in gene expression autorepression of transcription 
may increase the tendency to correct fluctuations, but would also introduce random 
association and dissociation of the repressor to DNA, which can have enormous 
randomizing effects (Paulsson, 2004; Tomioka et al., 2004). 
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8.7 Noise Induced Transitions 
8.7.1 Stochastic Focusing and Noise-Suppression-by-Noise 


Plasmids with the replication control systems described in section 8.6 by equa- 
tion 8.48 seem to be limited by G < h, that is, the total gain over the feedback 
loop cannot be higher than the Hill coefficient of inhibition. Some of these plasmids 
have further been suggested to use so-called hyperbolic mechanisms (Hill coeffi- 
cient h = 1) and thus seem doomed to inefficient noise suppression. However, the 
FDT approach is a so-called mean-field approximation and depends on linearized 
responses. It can be qualitatively misleading in some nonlinear systems. 

When inhibitors fluctuate rapidly (Ti < T, in equation 8.47), the bivariate 
master equation behind equation 8.48 can be replaced by a univariate master 
equation for nı and a conditional master equation for the probability P(n2|n1) 
of ng given nı. The simplest inhibitor dynamics that give rise to equation 8.48 
are Poissonian synthesis with rate Aon combined with exponential decay with 
rate u2n2. This would generate a Poissonian conditional probability P(n2|n1) with 





conditional average (n2) = don py If we assume that the probability that a 
replication attempt is successful is given by the hyperbolic function 
K 
= 8.51 
a(n) = (8.51) 


then the true average q is 


Co 


(alnı) = J. a(n2) P (nı |n2) # q ((n2)) (8.52) 


ni=0 


This reflects the fact that (q|n1) receives a disproportional contribution from the 
left tail of the distribution where nz is low. However, not only the actual value is 
affected, but also the normalized sensitivity to changes in nı. Even with simple 
Poisson fluctuations, the inhibition function above with Hill coefficient of h = 1 can 
locally behave as if h = 2 or higher. That is, it is possible to have 


_aln((q|m)) 


Oe as Ga) 


>1 (8.53) 
The low-copy noise can thus make for a more sharply changing function. This can be 
understood as follows: The probability that the next inhibitor event is a birth rather 
than death is Agn1 (A2nı + uono) +. The probability that ng randomly walks away 
from its average (nz) = Az2n1p,;' to the lower values where the nonlinear (q|n1) 
receives a disproportional contribution thus depends on the birth intensity Aznı 
at every step. The concentration nı thus affects multiple transitions, and the fi- 
nal effect is indeed similar to other schemes for multistep sensitivity amplification 
(Ehrenberg and Blomberg, 1980; Freter and Savageau, 1980), like kinetic proofread- 
ing (Hopfield, 1974; Ninio, 1975). This principle was called stochastic focusing in 
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biochemistry (Paulsson et al., 2000; Paulsson and Ehrenberg, 2001). It is a type 
of noise-induced transition (Horsthemke and Lefever, 1984) and reminiscent of the 
nonlinear effect in the bimolecular rate r = An? where (r) = A ((n)? +02) as 
pointed out early in the literature on stochastic chemistry (Renyi, 1954). If x rep- 
resents the protein from the gene expression model in section 8.3 we can tune the 
rates of transcription and translation in such ways that the average (ny? goes down 
but the variance o2 goes up so much that the sum ln)? +o? actually increases. 
The average rate of the bimolecular reaction could then go up even if the average 
concentration went down. This again means that the effective nonlinearities can be 
modulated almost arbitrarily by modulating the fluctuations around an average. 

So far we only looked at how the underlying fluctuations affect the average value 
of the nonlinear function, but they also affect fluctuations in the same. The effect 
much depends on the relative timescale of fluctuations. If the inhibitor fluctuations 
above are fast compared to the lifetime of the plasmid (T2 <7), then only the 
conditional average of q will have an effect on the plasmid dynamics. This is because 
only persistent fluctuations enslave dependent processes. However, if the inhibitor 
fluctuations are not fast enough, then inhibitor fluctuations can have disastrous 
consequences for the plasmid regulation, drastically widening the distributions. This 
phenomenon is analyzed in more detail in (Paulsson and Ehrenberg, 2000). 

There is also another timescale that is important in this context. The hyperbolic 
function above comes from a condensation of uni- and bimolecular reactions. For 
many plasmids, the initiation frequency Az is the rate with which they enter an 
intermediate state IJ from which they then decide to continue with replication or 
abortion, according to 


to replicate 
kana | (8.54) 
abort 
The probability for replication is then 


O ke K 
kp +kang K+m 





q (8.55) 


The averaging in equation 8.52 thus implies that the inhibitor number nz remains 
constant for the duration of the event. If the inhibitors fluctuate infinitely fast, 
the abortion rate would simply be ka (n2) and the stochastic focusing effect would 
disappear (this effect was discussed in more mathematical detail in (Paulsson et al., 
2000; Paulsson and Ehrenberg, 2000). For plasmids to exploit stochastic focusing 
for noise suppression, there are thus two restrictions on the timescales: Inhibitor 
fluctuations must be much slower than the duration of the individual event to 
affect the average inhibition (including diffusion), and much faster than plasmid 
fluctuations to avoid randomizing the same. Both conditions seem satisfied for 
the best characterized plasmids. Plasmids change on a timescale on the order of 
hours, inhibitors change on the order of a few minutes, and the chemical decision 
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in equation 8.54 takes about 10-20 seconds. The inhibitors are small RNAs and 
probably capable of diffusing many times across the cell during this time. 

For applications to other systems, it may also be worth mentioning that the 
hyperbolic control function also can arise from 


on === off (8.56) 


where na is the number of available repressor molecules. There is a subtle difference 
though. Immediately after the repressor has fallen off, there is at least one repressor 
molecule in the system. This affects the distribution of repressor molecules slightly. 
If the repressors are made in a Poisson process and decay exponentially, this subtlety 
is in fact enough to entirely abolish the stochastic focusing effect due to detailed 
balance constraints. However, if the inhibitors display other and perhaps more 
realistic types of fluctuations, for example if they are produced in bursts, then 
stochastic focusing can have large effects on mechanisms like equation 8.56. For a 
more detailed analysis see (Berg et al., 2000b). 

Signal noise can thus in principle be used to make control more regular and 
deterministic—even in the simplest monostable negative feedback systems without 
sensitive bifurcations. It does require a separation of timescales, which rules out 
some candidate mechanisms, but is still a very real possibility in some of the best 
characterized negative feedback systems, like replication control of plasmids R1 or 
ColE1. It can also be used to create bistability in mechanisms that otherwise would 
be doomed to monostability. For example, macroscopic analyses of some mutually 
repressive systems have shown that hyperbolic repression functions (equation 8.55) 
are not sensitive enough to generate bistability (Cherry and Adler, 2000; Gardner 
et al., 2000). That conclusion no longer holds true when spontaneous fluctuations in 
concentrations are taken into account; stochastic focusing can make the hyperbolic 
functions sensitive enough to support bistability. 


8.7.2 Noise-Induced Escape from Macroscopic Attractors 


Some biochemical systems can exhibit distinctly different, self-perpetuating states 
depending on previous stimuli (Angeli et al., 2004; Ferrell Jr., 2002; Monod and 
Jacob, 1961; Ozbudak et al., 2004)—including irreversible developmental switches 
in the cell cycle (Tyson et al., 2001), the maturation of oocytes (Xiong and Fer- 
rell Jr., 2003), the ubiquitous phosphorylation switches in signal transduction path- 
ways (Bhalla et al., 2002), and the lysis-lysogeny decision system of phage lambda 
(Ptashne, 1992). The attractors in these multistable systems are by definition lo- 
cally but not globally stable. A series of random fluctuations—originating in the 
random births and deaths of individual molecules—can thus force the system to 
escape one basin of attraction and allow it to be captured by another (Erdi and 
Toth, 1989; Horsthemke and Lefever, 1984; Kramers, 1940). Depending on the size 
of the fluctuations and the strength of the local stability, the escape rates can be 
arbitrarily low. For example, the probability for phage lambda to spontaneously 
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switch from a lysogenic to a lytic stage is on the order of 1078 per generation 
(Aurell and Sneppen, 2002). Seen across large populations, however, this can still 
be a large enough number to have dramatic consequences for the population as a 
whole. In addition to spontaneous escapes from a given attractor, there is also a 
probabilistic initial choice of attractor when an infecting phage commits to either 
lysis or lysogeny, something that has been extensively studied using Monte Carlo 
sampling (Arkin et al., 1998). 

The escape between distant attractors can not be analyzed by local linearized 
models, including FDT, since the escape characteristics mostly depends on what 
happens in between the attractors. Analytical approximations can sometimes be 
useful to characterize the escape rates (Aurell and Sneppen, 2002), but often 
numerical methods are the only practically useful way to study global dynamics 
of these systems. Straightforward Monte Carlo sampling typically converges too 
slowly for such problems, though, and the full master equation typically has too 
many states for direct integration. An attractive alternative is to approximate the 
master equation by a Fokker-Planck equation (FPE) (Risken, 1984), which is a 
partial differential equation for the time-dependent probability density function. 
The FPE approximation is good when the probability distribution function varies 
smoothly over state space. Since the FPE is non-local it can be used to analyze 
escape from macroscopic attractors (Qian et al., 2002), and it is also suitable for 
numerical integration using the extensively refined methods developed for partial 
differential equations (Ferm et al., 2004). 

An example of FPE integration for a noise induced escape from a macroscopic 
attractor is illustrated in figure 8.3. In this example a trajectory that escapes the 
macroscopic point-attractor ends up in a limit cycle attractor in a model of a 
circadian oscillator (Vilar et al., 2002). Many organisms have evolved internal clocks 
to keep track of time. These are often based on biochemical oscillators that then 
must be resistant to some environmental and internal cues (Barkai and Leibler, 
2000; Mihalcescu et al., 2004). One possible mechanism for generating a circadian 
oscillator is to use a transcriptional activator protein that promotes both its own 
expression and the expression of a repressor protein which in turn sequesters the 
activator. Vilar et al. present a quantitative model for such a system including 
the activator and repressor proteins, their respective mRNAs, the activity state 
of their promoters, and the activator-repressor complex. The model is reported 
to display regular oscillations in activator activity, even for relative large internal 
fluctuations in the levels of some of the chemical species (Vilar et al., 2002). In 
fact, for some parameters, internal fluctuations can drive the oscillation even if 
the corresponding macroscopic system has a single stable nonoscillating attractor. 
Rather than destroying the regular oscillations, the random fluctuations thus make 
them possible. This is illustrated in figure 8.3. 
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Figure 8.3 Noise induced oscillations. In A we see an example of the time evolution 
of two components of the circadian clock system with repressor (solid) and activator- 
repressor complex (dashed) in a macroscopic (gray) or a stochastic (black) model. The 
macroscopic model settles in a steady state whereas the stochastic model oscillates. In B 
we see the time evolution of the whole probability density function (PDF) as modeled by 
the Fokker-Planck approximation of the master equation. Initially the PDF was localized 
close to the macroscopic attracting stationary state. Throughout the time evolution the 
FPE was adaptively discretized as indicated by the grids (Ferm et al., 2004). The equations 
and parameters are those given for the bivariate RC-system by (Vilar et al., 2002) except 
for SR=0.1h~*. 
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Notes 


1. By homogeneous we mean that each molecule has an equal probability to be 
anywhere in the volume on the timescale of the chemical reactions that change the 
state (see chapter 9). 

2. One component always turns into another in chemical reactions, but in 
condensed descriptions some species are approximated as constant sources or sinks 
of matter and not included as state variables. 


9 Kinetics in Spatially Extended Systems 


Karsten Kruse and Johan Elf 


From cells to tissues and organisms, biological systems display spatially inhomoge- 
neous structures. They result from processes in which the time for the transport 
of proteins across the whole system is long compared to typical reaction times. 
In this chapter, theoretical approaches for describing the dynamics of such sys- 
tems are presented. In the first part, continuum descriptions in terms of partial 
differential equations are discussed. Such a description is appropriate if one is in- 
terested in the dynamics on scales that are large compared to molecular length 
scales as, for example, interaction distances of single molecules. In this context, a 
key concept is that of currents, which account for the transport of particles. Several 
techniques for deriving expressions for currents are discussed. On smaller scales, 
the discrete nature of the molecules cannot be neglected and a stochastic descrip- 
tion is required. In particular, this is the case when a molecule has only a few 
potential reaction partners within the diffusion range. A stochastic description in 
terms of the reaction-diffusion master equation is presented in the second part of 
this chapter. It is a generalization of techniques presented in chapter 8 to account 
for inhomogeneous particle distributions. As will be shown, in the limit of many 
reactants within the diffusion range, the reaction-diffusion master equation is well 
approximated by a continuum description. The different approaches are illustrated 
by application to the Min-system of the bacterium Escherichia coli as well as other 
subcellular systems. 





9.1 Continuum Descriptions 


Continuum theories describe the dynamics of spatially extended systems on scales 
that are large compared to molecular scales (Landau and Lifshitz, 1995). In such 
a description, the discrete nature of the single molecules forming the system is 
neglected. Instead, the state of the system is given in terms of continuous functions 
of space and time, the fields’. In the simplest case, the fields represent densities, 
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for example of proteins. In addition, they can represent additional features of the 
molecules involved, for example the mean orientation of elongated molecules like 
cytoskeletal filaments. The fields are linked to microscopic representations of the 
system state in terms of individual molecules by local averages. Local averages are 
performed over volume elements that are small compared to the length scales of 
the structures one is interested in, but large enough to contain a sufficient number 
of particles such that spatial fluctuations within a volume element are negligible. 

Example: Min-oscillations. Division of the bacterium Escherichia coli usually 
occurs in the cell center leading to two daughter cells of equal size. Selection 
of the center as the division site is in part achieved by the Min-system which 
consists of three proteins, MinC, MinD, and MinE (de Boer et al., 1989; Bi 
and Lutkenhaus, 1993). While MinC inhibits assembly of the division apparatus 
on the cytoplasmic membrane, MinD and MinE regulate the spatial distribution 
of MinC. Fluorescence microscopy of Min-proteins tagged with green fluorescent 
protein (GFP) has revealed that the distributions of the Min-proteins change 
periodically with time (Raskin and de Boer, 1999b; Hu and Lutkenhaus, 1999; 
Raskin and de Boer, 1999a; Hale et al., 2001). During one half of the period, most 
proteins are localized in one cell half, while during the other half of the period they 
predominantly reside in the opposite cell half. The oscillation periods vary from 
cell to cell and range from 40 seconds to 120 seconds. As a consequence of the 
oscillations, MinC suppresses formation of the division apparatus close to the cell 
poles, but not in the center. The oscillations are generated by MinD and MinE alone, 
while MinC oscillates because it co-localizes with MinD. Over the last few years, 
several continuum descriptions of the Min-protein dynamics have been developed 
(Meinhardt and de Boer, 2001; Howard et al., 2001; Kruse, 2002; Huang et al., 
2003; Drew et al., 2005; Meacci and Kruse, 2005). In these descriptions, the fields 
are given by the surface densities of MinD and MinE on the cytoplasmic membrane 
and the volume densities of MinD and MinE in the cytoplasm. As the distribution 
of MinC is directly related to the distribution of MinD, it is not incorporated. 

In a continuum description, the dynamics of the fields is commonly given by 
partial differential equations?. The dynamic equations depend on a number of 
phenomenological parameters. While the values of these parameters are determined 
by details of the molecular interactions, the form of the dynamic equations is largely 
independent of these details. Rather, it is imposed by the symmetries displayed by 
the system. For example, the equations must transform correctly if the system is 
rotated?: see sect. 9.1.3 for further discussion of the role of symmetries. Hence, 
on large scales the system’s behavior is independent of most properties of the 
microscopic molecular interactions. 

As an example, consider fluid water. Water molecules are characterized by their 
charge distribution, and their interactions involve dipole-dipole interactions and 
hydrogen bonds. For most practical purposes, however, the flow of water can be 
described by the Navier-Stokes equation (Landau and Lifshitz, 1995). Neglecting 
the very weak compressibility of water, this equation contains only two parameters, 
the water density and the shear viscosity. The same equation also describes the 
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flow of all other simple fluids which may consist of molecules very different from 
water. The implications of these differences for the dynamics on large scales are 
fully captured by distinct numerical values of the phenomenological parameters. 
Therefore, appropriate continuum descriptions of spatially extended systems can be 
obtained from much less information than is needed for microscopic descriptions. 
On a more technical level, continuum descriptions in addition permit the use of the 
powerful methods of differential calculus for analysis. Taken together, these points 
make continuum descriptions an extremely helpful tool to investigate mechanisms 
underlying the formation of spatiotemporal structures. 

In the following, general principles that guide the formulation of continuum de- 
scriptions will be presented. Before continuing with the general discussion, however, 
first the very important class of reaction-diffusion systems is introduced. 


9.1.1 Reaction-Diffusion Systems 


In his groundbreaking paper on the chemical basis of morphogenesis, Turing intro- 
duced the idea that the diffusion of particles together with chemical reactions can 
lead to the formation of spatiotemporal patterns (Turing, 1952). Having biologi- 
cal systems in mind, Turing suggested that these patterns might be at the origin 
of structures in living systems such as the regular arrangement of the tentacles 
of hydra. In fact, the formation of compartments in Drosophila and calcium dy- 
namics in cell aggregates as well as within cells have been successfully described 
using a reaction-diffusion approach (Cross and Hohenberg, 1993; Koch and Mein- 
hardt, 1994; Falcke, 2004). The application of reaction-diffusion systems to describe 
intracellular protein dynamics is a more recent development. 

In a reaction-diffusion system each field represents the density of one particle 
species. The different species can, for example, represent different kinds of molecules 
or different states of one kind of molecule. The reaction terms correspondingly 
describe reactions involving the different molecules or transitions between the 
different states. In their most general form, the dynamics of two interacting species 
is described as 





o 
gond = DıV?c (r,t) T u1(c1, C2) (9.1) 
o 
Ot (r, t) = DoV?c9 (r, t) ar u2(c1, c2) (9.2) 


Here, ci(r, t), i = 1,2 denotes the densities of the two species at a point r = (x,y, z) 
in spacetand at time t. The operator 0/0t denotes the partial derivative with respect 
to time, that is a derivative with respect to time while the space coordinates are 
kept constant. The first terms on the right hand sides describe particle diffusion. 
The parameters D; are the respective diffusion constants and V? is the Laplace- 
operator. In three spatial dimensions, V? = 0?/0x? +0? /Oy? +0? /0z?. Here 0? /Ox? 
is the second partial derivative with respect to x and so on. The form of the diffusion 
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term will be derived below. The functions u; depend on the densities and account 
for the reactions in the system. 

Example: Min-oscillations. Several models for the dynamics of the Min- 
proteins fall into the class of reaction-diffusion systems (Meinhardt and de Boer, 
2001; Howard et al., 2001; Huang et al., 2003; Drew et al., 2005). There, cytosolic 
MinD and MinE diffuse, while diffusion of membrane-bound proteins is usually 
neglected. The reaction terms account for the exchange of proteins between the 
cytoplasm and the membrane. It was shown in vitro that the ATPase MinD has a 
high affinity for the inner bacterial membrane if ATP is present (Hu et al., 2002). 
Furthermore, for concentrations of MinD exceeding a critical value, filamentous 
MinD aggregates are formed on the membrane. MinE associates with the membrane 
only in the presence of MinD. There, it stimulates hydrolysis of the ATP bound 
to MinD, which eventually drives the proteins off the membrane. These results are 
compatible with the behavior of MinD and MinE in vivo. Several different reaction 
schemes have been developed that incorporate these findings. As an example, 
consider the model proposed by Huang et al. (2003). There, binding of MinD to the 
membrane is assumed to be cooperative, leading to the aggregation of membrane- 
bound MinD. The binding of MinE to the membrane is described by a second order 
process involving the concentration of membrane-bound MinD. On the membrane, 
MinE is assumed to exist only in complexes with MinD. Finally, the release of 
MinDE complexes is described as a first order process. Explicitly, the dynamic 
equations are 


cp = Dp@ep— [wp + Hap(ca + Cae)|ep + WaeCde (9.3) 
O:ce = Dpô?cp +waeCde — WECACE (9.4) 
Oca = —WECaCE + [wp + Map (ca + Cae)|eD (9.5) 
OrCde = —WaeCde + WECACE (9.6) 


For simplicity, the dynamic equations are given here in one spatial dimension and 
in the limiting case of immediate rebinding of ATP to MinD after it is released 
from the membrane. The distributions of cytosolic MinD and MinE are denoted by 
Cp and cp, while cg and cge denote the densities of membrane-bound MinD and 
MinDE complexes, respectively. The remaining parameters denote rate constants 
for the different reactions. Note that in agreement with experimental results, the 
above equations conserve the numbers of MinD and MinE proteins. 

The behavior of a reaction-diffusion system is determined by the values of the 
diffusion constants and of the various rates. An analysis of the dynamic equations 
9.1 and 9.2 usually starts with the identification of spatially homogeneous stationary 
states, c;(r, t) = of for all r and t (Cross and Hohenberg, 1993). For such a state, 
the time and space derivatives appearing in the dynamic equations vanish such 
that uj (ce, c) = 0 for i = 1,2. Then, the stability of these states with respect 
to perturbations is analyzed. The basic idea is the same as for ordinary differential 
equations (chapter 6), but in the present context, the perturbation can depend on 
the space coordinate. 
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For the stability analysis, the perturbed distribution is written as c; = of) +6c;. If 
the state is stable, the perturbation decays with time. In the opposite case, it grows 
and a pattern is formed. Inserting the above expression into the dynamic equations 
yields the time-evolution of the perturbation 6c;. If the initial perturbation is small, 
only terms linear in 6c; have to be retained; non-linear terms are much smaller and 
can therefore be neglected. This leads to 


o ôcı =Ù 6c} (9.7) 
ot 6c2 6c2 


Here, the linear operator L is given by 


DV? 
L= Vo +U u12 (9.8) 
u21 DV? + u22 


The constants u;; with i, j = 1, 2 are defined as uj; = Ou;/Oc;, where the derivatives 
are evaluated at c; = of), 

To proceed further, the densities are decomposed into eigenmodes of the linear 
operator L. The effect of L on an eigenmode ¢ is to multiply the mode with a 
constant Ay, the corresponding eigenvalue. Explicitly, L = Ag¢@. For an eigenmode, 


the dynamic equation 


o 
ab =p = rod (9.9) 
is readily solved 
olr, t) = exp(Agt) (rr, 0) (9.10) 


where ¢(r,0) is the initial perturbation at time t = 0. If the eigenvalue Ag has 
a negative real part, the mode will decay exponentially in time. In the opposite 
case, it will grow. The homogeneous state is, in this case, unstable under the 
corresponding perturbation, and a pattern will form. Parameter values for which 
the maximum of the eigenvalues’ real parts becomes positive indicate a bifurcation 
or instability. In the vicinity of an instability, the pattern is well described by the 
unstable eigenmode. If the corresponding eigenvalue is real, the instability leads to 
a spatially nonhomogeneous pattern that is stationary in time. In the opposite case, 
the pattern will oscillate in time (Cross and Hohenberg, 1993). 

For systems of a finite size, the determination of the eigenmodes and eigenvalues 
of the differential operator L is usually difficult. Therefore, an analysis is often 
carried out first assuming an infinite system size. Then, the perturbations can be 
decomposed into Fourier modes cy, with 


dc(r, t) = D as ckt)" (9.11) 


=65. 27 
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Figure 9.1 Schematic representation of the linear growth rate ReA; as a function of 
the wave number k. Graphs in one diagram can be obtained by changing a suitable 
phenomenological parameter in the dynamic equations. For the case displayed in a), the 
instability occurs at a finite wave number, in b) and c) at k = 0, respectively implying a 
periodic state and a homogeneous state right after the instability. 


Inserting this expression into equation 9.7, the dynamics of each mode is governed 
by a 2 x 2 matrix Lẹ with 


—Dy,k? 
Lk = ( 16° + u11 u12 ) (9.12) 


The eigenvalues of this matrix are 








Ak = ; [trl SE J (trl)? —A detLi| (9.13) 


where trLk = —(D, +D2)k?+u11 +u22 is the trace and detLy = (Dyk? —u11) (Dok? — 
u22) — u12U21 is the determinant of Ly. Stability requires the real parts of A, to be 
negative for all values of k. 

As a function of the mode number k, the real parts of the eigenvalues will show one 
of the functional behaviors indicated in figure 9.1. At an instability, the wave number 
of the critical mode ke as well as the imaginary part of the critical mode’s eigenvalue 
can be zero or non-zero. This leads to essentially three classes of patterns that are 
formed at an instability: stationary patterns with a characteristic spatial wavelength 
and oscillatory patterns with or without a characteristic spatial wavelength (Cross 
and Hohenberg, 1993). Beyond the instability, the non-linear terms can no longer be 
neglected. Often one then has to fall back upon numerical solutions of the dynamic 
equations. 

Summarizing, a linear stability analysis can give a rough understanding of the 
system behavior as a function of the system parameter without the need of explicitly 
solving the dynamic equations. If there are more than two densities involved, an 
analytic calculation of the eigenvalues is in general not possible. However, it is no 
problem to obtain them in this case numerically. 

Example: Min-oscillations. The homogeneous stationary state of the dynamic 
equations 9.3 to 9.6 is determined by the roots of a polynomial. For each value of 
the parameters it is unique and has to be determined numerically. The eigenvalue 
with the largest real part as a function of the wave-number k is shown in figure 9.2. 


9.1 


Continuum Descriptions 183 


a) b) 


0.05 } 


100s 


0.05 | 
50s 





Os 


0 0.2 0.4 0.6 08 i 0 x Lo 


Figure 9.2 Analysis of the dynamic equations 9.3 to 9.6. a) Real (solid line) and 
imaginary (dashed line) part of the eigenvalue with the largest real part as a function 
of the wave number k. In an interval of k-values, the real part is positive. At the same 
time the imaginary part is non-zero in this range, indicating an oscillatory instability. The 
gray dotted line indicates Re Ax = 0. b) Space-time plot of the total MinD-distribution 
obtained from a numerical solution of the dynamic equations. The density is color coded 
with brighter gray levels indicating a higher density. MinD periodically shifts from one 
end to the opposite end. 


The real parts are positive in an interval in which their imaginary part does not 
vanish. The instability is thus oscillatory. Correspondingly, a numerical solution of 
the dynamic equations yields a distribution that changes periodically in time. 


9.1.2 Densities and Currents 


The reaction-diffusion equations 9.1 and 9.2 are a special case of a continuity 
equation. The structure of a continuity equation reflects that at a given point in 
space, the number of molecules in a small volume element can change because of two 
possible events. First, particles can be transported into or out of this volume element 
and second, particles can be created or destroyed within the volume element. This 
applies to all conserved quantities? (de Groot and Mazur, 1984). 

Transport across the surface of a volume is described by currents. They give 
the number of particles traversing a surface element per unit time. A current j is 
a vector with components ji, where i indicates the three directions x, y, and z. 
Each component j; is the current through a surface element perpendicular to the 
direction 7. The net change of particle number in a small volume element is then 
obtained from the divergence of the corresponding current 


V-j= as + i + ] 
J an?” 3y” az?” 





(9.14) 
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Figure 9.3 Change in particle number due to particle transport in one dimension. 
The current at x is jz(x), the current at x + dx is jz(x + dx). The current is counted 
positive if directed to the right. The net change in particle number n per unit time 
is then the difference between the currents across the left and the right boundary of 
the interval, dn/dt = j2(x) — jx(x + dx). Replacing the particle number by the density 
c through n = c dz and taking the limit dz — 0 one obtains 0c/0t = —0j/Ox. For 
example, in the case of diffusion we have j = —D0c/Ozx, such that the diffusion equation 
is Oc/Ot = DA?c/Ax?. In higher dimensions the contributions of all directions have to be 
summed leading to expression 9.14. 


See figure 9.3. Here, V is the gradient operator with V = (0/0xz,0/0y,0/0z) in 
three dimensions. Creation or destruction of particles is captured by source and 
sink terms s. 

Hence, the continuity equation for the evolution of a particle density has the form 


f2] 
= el 
aor j=s (9.15) 


In the reaction-diffusion equations 9.1 and 9.2, the current is a consequence of 
diffusion. It is given by —D;Vc;, whereas the source and sink terms are given by 
the reaction terms u; with i = 1,2. 

While the source and sink terms are usually given by kinetic equations for the 
reactions taking place in the system, there is no generally applicable procedure or 
framework for deriving expressions for the currents. As will be discussed in section 
9.2, currents can be derived from microscopic descriptions by applying a mean- 
field approximation and then coarse-graining. Other approaches to the currents 
are phenomenological and do not require a microscopic model. At the system 
boundary, additional specifications have to be made. In situations where the system 
is confined by an impenetrable wall, currents across the system boundary have to 
vanish. In other situations, there might be a constant influx into the system, fixing 
the value of the current to a constant value. This would be the case for proteins 
that are generated at a constant rate in a source that is located at the system 
boundary. In more complicated situations, the current at the boundary depends on 
the present state of the system. An example is provided by cell walls containing 
receptor molecules to which proteins can bind. Here, the binding rate will depend 
on the occupancy of the receptors and probably on the presence of other molecules. 

Before presenting the phenomenological approach, let us mention again that a 
continuum description will in general also contain fields of nonconserved quantities. 
An example is the orientation of cytoskeletal filaments. Obviously, there are no 
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currents associated with such quantities, and their time evolution is consequently 
not given by a continuity equation. However, the same strategies that are used to 
obtain expressions for the currents can also be applied to the rate of change of these 
fields. Therefore, they will not be discussed explicitly in the following. 


9.1.3 Phenomenological Currents 


While there is no generally applicable framework for deriving phenomenological 
expressions for the currents, there are some universal constraints on possible 
expressions. They follow from the symmetries displayed by the system (Nicolis and 
Prigogine, 1977; de Groot and Mazur, 1984; Chaikin and Lubensky, 1995). First 
of all, if a cause is invariant under the action of a spatial symmetry operation like 
rotation or reflection, then the same must be true for any effects due to this cause. 
This condition is expressed by the Curie principle which furthermore states that if 
an effect is not invariant under a certain symmetry operation then neither can be the 
cause. One consequence of this principle is, for example, that the directed motion 
of molecular motors is only possible because actin filaments and microtubules are 
polar, that is because they have two different ends. 

A second symmetry that imposes constraints on phenomenological theories is 
the invariance under time-reversal of the microscopic equations of motion. That 
is, even though the expressions in a macroscopic description may not be derived 
from a microscopic description, the universal property of time-reversal invariance 
of the microscopic equations of motion imposes constraints on these expressions. 
This remarkable point was made by Onsager, who showed that for certain systems, 
different phenomenological parameters are intimately related (Onsager, 1931a,b). 

At thermodynamic equilibrium, a system is described by a set of macroscopic 
state variables like temperature, volume, and particle number (Chaikin and Luben- 
sky, 1995). The free energy F is a function of these variables. The equilibrium state 
is the one that minimizes the free energy F, while respecting the constraints im- 
posed on the system. If a constraint is released, the system will evolve towards a new 
equilibrium state that is determined by a minimum in the free energy respecting 
the new constraints. 

The dynamics of systems out of thermodynamic equilibrium, but sufficiently 
close to it, can still be obtained within a generally applicable framework. More 
precisely, it applies to situations, where each of the volume elements introduced 
above can be assumed to be in thermodynamic equilibrium. In this case, a free 
energy F of the total system can be defined by summing the free energies associated 
with the individual volume elements. Changes in the system’s free energy can 
then be expressed in terms of products of currents and associated causes, often 
called generalized thermodynamic forces. The currents are naturally functions of 
the generalized forces. In particular, if there are no forces there will be no currents. 
An expansion of the currents in terms of the forces thus starts with the linear terms. 
Keeping only these, one obtains phenomenological dynamic equations (de Groot 
and Mazur, 1984; Chaikin and Lubensky, 1995). The phenomenological parameters 
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appearing in the expansion are called linear response coefficients and depend in 
general on the state of the system. 

Example: Diffusion. Consider the case of a single species of noninteracting 
molecules at constant temperature and constant pressure. Then the free energy per 
unit volume f only depends on the particle density c, f = f(c). Consequently, 


d ð 0.0 
ae = pastor far Ghz 


= - [æu -j= faj- Vu 


In this calculation, the continuity equation 9.15 with vanishing source terms has 
been used and the chemical potential u = Of /Oc has been introduced. The chemical 
potential gives the change in free energy upon addition of a particle to the system. 
Expressing the current j in terms of the generalized force Vy one finds in linear 
order 


j= -AVu (9.16) 


where A is the phenomenological coefficient describing the response of the system, 
that is the current, to a gradient in the chemical potential. As the free energy, the 
chemical potential is a function of the particle density. Defining D = AOu/Oc, the 
diffusion current can then be cast in the familiar form 


j=-DVc (9.17) 


where D is the diffusion constant. The minus sign in equation 9.16 has been 
introduced to obtain D > 0. For an ideal gas y(c) = kgT lnc, where kg is the 
Boltzmann constant and T temperature. Since A/c = €~! is the mobility of a 
particle, one finds DE = kgT, which is the well-known Einstein relation (de Groot 
and Mazur, 1984). 

Example: Molecular motors. As an example of nondiffusive transport con- 
sider the motion of molecular motors (Alberts et al., 2002). These proteins use the 
energy derived from ATP-hydrolysis to move along cytoskeletal filaments. Exam- 
ples are kinesins that transport vesicles along microtubules or myosins that together 
with actin form the contractile machinery in muscles. The motion of motors along 
a filament is directional, where the direction of motion is determined by the orien- 
tation of the filament. For these motors, the current is given by j = VCmot, where 
Cmot is the motor density and v is the average motor velocity. The motion of mo- 
tors is driven the hydrolysis of ATP, but also by external forces. Hydrolysis of ATP 
occurs in the presence of a difference Ap in the chemical potentials of ATP and its 
hydrolysis products ADP and P;. The associated rate is denoted r. The change in 
free energy is f - v +r Ap and an expansion of v and r in terms of f and Ap yields 
in linear order (Jülicher et al., 1997) 
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v = Auf + AvAu (9.18) 
y = A21 . f + AzA u (9.19) 


Note, that A12 couples a vector quantity and a scalar quantity and must therefore 
be itself a vector illustrating the Curie principle. As mentioned above, the cross 
coefficients Ag; and Aj are not independent of each other. In fact, the Onsager 
relations impose in the present case A21 = A12. 

There are situations in which the general framework indicated above is not suffi- 
cient for obtaining appropriate expressions for the currents. First of all, interactions 
with the environment can lead to anomalous diffusion. This would be, for example, 
the case when proteins get trapped in small regions of space with corresponding 
dwell times that are algebraically distributed, that is, the probability of having a 
large dwell time t is proportional to t~"+” with 0 < y < 1. Another example is 
provided by particles moving on DNA that folds back on itself. As the particles 
detach from the DNA, diffuse through three-dimensional space and reattach at a 
different location on the DNA, the effective motion along the one dimensional DNA 
can be anomalous (Berg et al., 1981; Brockmann and Geisel, 2003). Such processes 
can be described in the frame of continuous time random walks (Montroll and 
Shlesinger, 1984). Secondly, systems can be far from thermodynamic equilibrium. 
Even though it is in general hard to measure the distance to thermodynamic equi- 
librium, presumably most people would tend to say that living cells are far away 
from it. Still, one might argue that the theory presented so far should in many 
situations describe the dominant effects. There are, however, situations in which 
linear terms are absent or dominated by non-linear terms such that more general 
expressions are needed. 

Example: Attractive interactions. In the case of an attractive interaction 
between proteins, macroscopic currents are induced by gradients in the protein 
density. In contrast to the case of diffusion, however, the current will be directed 
towards higher concentrations. One might therefore be tempted to describe the 
aggregation process by a diffusion equation with negative diffusion constant. This 
equation is unphysical for several reasons. First of all, it can lead to negative 
densities. Secondly, infinite particle densities can be generated in finite times. 
Finally, it generates structures on arbitrarily fine length scales. All these problems 
can be avoided by modifying the diffusion current such that 


j = C(Cmax — €)(k1 0,6 + k203c) (9.20) 


with kı,k2 > 0. For simplicity, it has been assumed in this expression that 
the motion of particles is confined to the z-axis. The pre-factor c prevents the 
appearance of negative densities, while the pre-factor Cmax — c prevents the density 
from growing beyond any limit by introducing a maximal density Cmax. Finally, the 
third order derivative avoids the formation of structures on arbitrarily small length 
scales. 
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Figure 9.4 Space-time plot of the total MinD distribution on the membrane obtained 
from a numerical solution, the dynamic equations 9.21 and 9.22. Results are shown for a 
system size of Lo = 24m (a) and 2Lo (b), revealing the period doubling characteristic 
for the Min-oscillations in longer bacteria. The density is color coded with brighter gray 
levels indicating a higher density. Modified from (Meacci and Kruse, 2005). 


Example: Min-oscillations. For the Min-oscillations a mechanism has been 
proposed that is based on an attractive interaction of membrane-bound MinD 
(Kruse, 2002; Meacci and Kruse, 2005). This attraction is assumed to be the 
dominant process for the formation of MinD aggregates. Assuming homogeneous 
cytosolic distributions Cp and Cg of MinD and MinE, respectively, the dynamics 
of the Min-system reduces to the evolution of the densities of MinD and MinDE 
complexes on the membrane. Explicitly, 


ca = wWpCp (Cmax — Cd 7 Cde) = WEC Rea =: GAN] (9.21) 


Ode = —WaeCde + WECECa (9.22) 


Here, the aggregation current 7 is chosen to be of the form of equation 9.20. The 
reaction terms describing attachment of MinD and MinE to the membrane and 
detachment of MinDE complexes from the membrane are linear in the membrane 
densities (Cp and Cf are constants). A linear stability analysis of the homogeneous 
state can be performed analytically in this case. It reveals a critical value ky,. 
for the parameter kı such that the homogeneous state is unstable for kı > ky,c. 
Furthermore, if the condition te < wpweCpCF is met, then the instability is 
oscillatory and oscillatory solutions reminiscent of the Min-protein oscillations are 
obtained; see figure 9.4. 





9.2 Stochastic Treatment of Nonhomogeneous Chemical Reactions 


As stated above, continuum descriptions are appropriate only if the molecular 
densities are large enough, so that each molecule has many potential reaction 
partners within the diffusion range. If they are too few, then the discrete nature 
of the molecules becomes apparent and a stochastic description is required. In 
the derivation of the master equation in chapter 8 it was assumed that the spatial 


9.2 Stochastic Treatment of Nonhomogeneous Chemical Reactions 189 


distribution of molecules equilibrates on a shorter time scale than the characteristic 
time scales for changes in the state variables. This was necessary in order to be able 
to take the transition rates r;(m) as constant for a state n, where the numbers of 
molecules are counted for the whole reaction volume. However, if the molecules do 
not have time to diffuse through the reaction volume between their reactions, the 
rates will not only depend on the total number of different molecules, but also on 
when and where other reactions occurred. The Markovian property of the random 
process is then lost for descriptions that only include the total numbers of molecules 
as state variables. 
The condition for homogeneity by diffusion is that 


T; > L?/D; for all i=1...N (9.23) 


where T; is the average time between two reactions involving species i, D; is its 
diffusion constant and L is the linear size of the system (Arnold, 1980; Gardiner 
and Steyn-Ross, 1984). When equation 9.23 is satisfied, each molecule has an equal 
probability to have its next reaction anywhere in the reaction volume. Thus the 
local deviations from the spatially averaged concentration induced by localized 
chemical reactions are spread throughout the system, such that the reaction rates 
are homogeneous. When equation 9.23 is not satisfied, the homogeneous master 
equation, used in chapter 8, is an approximation where it is assumed that (r;(m)), ~ 
rj ((n)o), that is the spatially averaged rates that we need in the homogeneous case 
equal the rates evaluated for the total number of molecules. This approximation 
may be good or bad depending on how sensitive the transition rates are to local 
perturbations in concentrations and how the molecules are distributed spatially. 


9.2.1 The Reaction-Diffusion Master Equation 


One way to model spatial heterogeneity is to introduce local concentrations. Here 
we do this by dividing the total volume Q into C artificial cubic subvolumes of 
volume A = 2/C and by keeping track of how many molecules there are in each 
subvolume. The side length of the subvolumes @ is chosen so that equation 9.23 is 
satisfied if L replaced by @. With this choice of subvolume size the mean reaction 
free path, the Kuramoto length (Kuramoto, 1974), is longer than a subvolume, 
and the spatial distribution of molecules within a subvolume can be considered 
homogeneous on the time scale of the chemical reactions. 

At the same time, the subvolumes must be much larger than the mean free path. 
This is necessary to describe the movement in a subvolume as a diffusion process. 
The mean collision free path is very short in cells due to the high concentration of 
non-reactive molecules, for example solvents. This typically makes more detailed 
descriptions, including the velocity of the molecules, unnecessary, since the velocity 
distribution equilibrates on the time scale of non reactive collision with the solvent 
molecules. 
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Finally, the length of the subvolumes, £, must also be significantly larger than the 
reaction radii® (Berg, 1978b; Ovchinnikov et al., 1989) of all interactions, which 
for biomolecules can be a more demanding requirement than the mean free path. 
This is required for well-defined association and dissociation rate constants within 
each subvolume (Elf and Ehrenberg, 2004). If the reaction subvolumes are made 
smaller than the reaction radii, molecules have a hard time finding each other, but 
when they do they never let go. 

The extended state description is {n} = {ny---n,---nco}, where nę = 
{nig tt Nig Nyk} and Nig is the number of i molecules in subvolume «. The 
state of the system is changed by chemical reactions within the subvolumes and 
diffusion events between the subvolumes. The chemical reactions have different 
rates in different subvolumes since they depend on the local concentrations of re- 
actants A~'n,. The probability that a reaction j will occur in subvolume « during 
the infinitesimal time between t and t + dt is dt-r(A~‘n,.). If this reaction occurs, 
the local state is changed from n, to n, + vj. 

Diffusion is modeled as a memory-lacking random walk in discrete space, as 
implemented by a set of first order diffusion events: 


190, 
Nindi! 
{e niy Nine} — {e niy t let Nin — 1} KH], 2 cap? (9.24) 
vy=1,2,...,C 


Here, an i-molecule diffuses from subvolume « to subvolume y. The first order 
diffusion rate constant for species i is taken to be d” = d#^ = D,/€ for neighboring 
subvolumes and otherwise zero. This implies that the probability that an 7-molecule 
diffuses from subvolume « to its neighbor A during the infinitesimal time between 
t and t + dt is dt - d*\nix. 

Given the extended state description and a set of state transition rates for the 
local reaction and diffusion events, we can write down our reaction-diffusion master 
equation (RDME) as in chapter 8 (Kuramoto, 1974; Gardiner et al., 1976; Nicolis 
and Prigogine, 1977; Baras and Mansour, 1997): 


dP({n},t) — 
dt 


E D (ry (Es = 29) AT Pm = y Jt) = yA 1)P({n},t))+ 


EM (niy +1) P (Ç niy +1, nis — 1}, t) — di ni P ({n} , t)) 


k $k i 

(9.25) 
The upper row contains the state transition rates that are due to reactions (j = 
1,..., M). For each subvolume these rates are calculated for the local concentration 
of molecules. The lower row contains the terms for diffusion between neighboring 
subvolumes. Here, {---niy+1,--+ , nix, — 1---} is the state where there are njy+1 i 
molecules in subvolume y and n;,-1 i molecules in subvolume « compared to state 
{n}. Figure 9.5 illustrates the principles from the perspective of the single molecule. 
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Let us mention that in cases where the microscopic transport of particles is not only 
due to diffusion, generalizations of the RDME can be used. 





glo 


Figure 9.5 Example of how chemical reactions are modeled by the Reaction-Diffusion 
Master Equation illustrated in two spatial dimensions. The probability that the black 
molecule jumps to one of the neighboring subvolumes in the next infinitesimal time dt is 
dt x 4x D/€, where D is the diffusion constant of the black molecule, £ is the length of 
the subvolume, and 4 is the number of neighbors. The probability that the black molecule 
instead binds one of the two white molecules is dt x A x ka x 1/A x 2/A, where ka is the 
association rate constant, A is the volume of the subvolume, and 1/A and 2/A are the 
respective concentrations of black and white molecules in the subvolume. 


Macroscopic Currents from the Reaction-Diffusion Master Equation 

In the limit that there are macroscopically large numbers of reaction partners within 
the diffusion range of each molecule, the RDME converges to the macroscopic 
reaction-diffusion equation introduced in section 9.1.1 (Arnold and Theodosopulu, 
1980). 

To get an intuitive idea in one spatial dimension about how to reach this result, 
consider the average change in the number, Nik, of i molecules in subvolume « 
between t and t + dt, under the assumption the system is in state {n} at time t. 

(dnin) = 5 Uij [dt 3 Tj (n,.A~")] + 1 H 

j 

» Dinik dt - Diniik 
-2 [=m] +1 [y] (9.26) 
The stoichiometries of change in n;, for the different events are here weighted 
by different events’ probabilities given in brackets. Assuming that there are many 
molecules in each subvolume such that the molecule copy number distribution in 
each subvolume is well represented by the average concentration Cis = (nik) AT! 
and (r; (n,A~+)) ~ rj (Ck), we can approximate (niy (t + dt)) = (nj,(t))+(dni,(t)) 

by 


Ci(k—1) — 2Cir + Cif 
Cin (t + dt) = Cin (t) + dt AT! 5 Vijf j (Cx) i Di ( 1) 2 inet) 
Jj 





(9.27) 
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Further, if we recognize that the concentrations vary smoothly between the sub- 
volumes due to the constraint that neighbors should be in diffusion equilibrium 
equation 9.27 can be rewritten as 


se t) Oc(x,t 
a =i aos viry (elx, t)) + jen e (9.28) 





where c;(a,t) is the average concentration of species į at position x at time t. 
This approach can also be used in connection with a generalized RDME mentioned 
above. Then, the currents obtained in the high density limit differ in general from 
the diffusion current (see (Bollenbach et al., 2005) for an example). 

In the limit of fast diffusion the RDME converges to the “ordinary” chemical 
master equation addressed in chapter 8 (Kuramoto, 1974; Gardiner and Steyn- 
Ross, 1984). 


9.2.2 Sampling the Markov Process of the Reaction-Diffusion Master 
Equation 


The RDME is too complicated for analytical approaches, especially if the system 
displays “exotic” properties, such as bi-stability, ultra-sensitivity, oscillations, spatial 
pattern formation, etc. As an alternative, it is possible to sample the Markov 
process one event at a time using an appropriate Monte Carlo method. For one- 
dimensional systems such simulations were pioneered in 1979 (Malek-Mansour and 
Houard, 1979) using the SSA (Gillespie, 1976), see chapter 8 and chapter 16. For 
three-dimensional systems, where the number of possible events is astronomical, 
a number of algorithmic improvements are required. See for instance Fricke and 
Wendt (1992); Hanusse and Blanché (1981). 

The most recent improvements were made with the next subvolume method (Elf 
and Ehrenberg, 2004), which is an adaptation of Gillespie’s SSA (Gillespie, 1976) 
and the neat reaction method (Gibson and Bruck, 2000a) to the special structure 
of the RDME. The starting point is to sum the rates of all reaction and diffusion 
events in each subvolume. Let us call these sums rẹ for the different subvolumes k. 
The time to the next event in each subvolume is then exponentially distributed with 
an average of 1/r,,. Given an initial distribution of molecules, the time for the first 
event in each subvolume is sampled from the respective exponential distribution, 
and the subvolumes are ordered in a priority queue according to when the events are 
scheduled to appear. The first event in the whole system occurs in the subvolume 
on top of the priority queue. The event that actually occurs in this subvolume is 
sampled in proportion to the rates of the different events that can occur in this 
subvolume. If it is a reaction event, it will change the state of the subvolume, which 
means that some of the rates for the events in the subvolume must be recalculated. 
Further, a new time for the next event in this subvolume must be sampled, and the 
corresponding element in the priority queue is sorted accordingly. If it is a diffusion 
event, two subvolumes are involved in the event and two elements in the priority 
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queue must be reordered according to their new event times. The next subvolume 
method can be used to simulate systems with millions of subvolumes and molecules. 
It is implemented in the software tools MesoRD (Hattne et al., 2005) and SmartCell 
(Ander et al., 2004). 

An alternative to the RDME approach to spatially dependent stochastic kinetics 
is to use particle-based simulation methods. These are usually discretized in time 
instead of space. The Brownian motion of the molecules is sampled at fixed time 
intervals, assuming that molecule displacements in space during the time interval 
follow a Gaussian distribution. Depending on the positions of the molecules in space, 
it is decided if nearby molecules have reacted or not during the last time interval. 
The available software tools (MCell (Stiles et al., 1998) and SmolDyn (Andrews 
and Bray, 2004)) make this decision in different ways. As a point of reference to the 
RDME treatment, one can consider the case where the time step is chosen as the 
mean time to the next diffusion event in the RDME description, that is At = (?/2D 
in one dimension. In this case the root mean square (RMS) displacement during the 
time step equals the length of one subvolume. If, in addition, the reaction probability 
during this time step is calculated from the local concentration within a radius equal 
to the RMS, the particle based and RDME based methods are very similar. Another 
algorithm that should be mentioned in this context is the Green’s Function Reaction 
Diffusion algorithm (van Zon and ten Wolde, 2005). The GFRD can be used for 
very detailed reaction-diffusion simulations, since it neither is discretized in time 
nor in space. 


9.2.3 Examples 


Annihilation Kinetics A + B © 6 

A simple example illustrates how spatial fluctuations can change the kinetics for 
very simple reaction schemes. Consider the reaction A + B E Ø, with initial 
concentrations 1(0)/Q = n2(0)/Q= 104M, k=108M~'s~?, and D=1078cm?s71. 
The molecules are randomly (that is uniformly) distributed in a volume of Q=1071? 
liters divided into 10° cubic subvolumes of 10718 liters. 

Figure 9.6 shows the decay in the numbers of A and B molecules when the 
subvolumes are distributed in one, two, or three spatial dimensions. These decay 
rates should be compared to the corresponding (mean-field) deterministic reaction- 
diffusion description, where the geometry of the reaction volume does not matter if 
the initial distribution of molecules is uniform. The RDME treatment shows that 
the molecules disappear much slower than what would be deduced from the mean- 
field model. This is due to inevitable concentration imbalances in the systems, 
where, for instance, regions with more A than B molecules will consume all B 
molecules. Once regions dominated by one of the species have been established, 
further reactions can only occur at interfaces between A and B regions, which 
makes diffusion of molecules to the interfaces limiting for the rate of annihilation. 
For this simple reaction scheme, the simulated data can be compared to analytical 
work on the corresponding RDME using renormalization group methods (Lee and 
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Cardy, 1995), that demonstrate that (n1) = (nz) œ t744, where d is the dimension 
of the system. 


«n,(t) =n (t) 





16 40" 
Time (s) 


Figure 9.6 Geometry effect on the rate of annihilation in the A+B —, @ reaction. The 
simulation was run in MesoRD. 


Noise Induced Domain Separation in Bistable Systems 

Noise induced transitions were discussed in chapter 8. Here we consider additional 
noise induced properties that can arise in spatially extended systems. In particular, 
we will use a simple bistable system built on the double negative feedback principle 
(figure 9.7) to illustrate how internal fluctuations and slow diffusion can change the 
escape properties. The system can be either in a state where the E4 enzymes make 
a lot of A molecules that can bind and inhibit the Eg enzyme (dark grey ellipse) or 
in the state where Eg enzymes make a lot of B molecules that can bind and inhibit 
the E4 enzymes (light grey ellipse). 


fai 





Figure 9.7 Double negative feedback schemes. E4 makes A and Eg makes B. A inhibits 
Ep and B inhibits Ea. 


In a macroscopic analysis of the system (homogeneous or inhomogeneous) the 
system goes to one of the two attractors and stays there. However, when we consider 
the fluctuations, there is always a chance that the system escapes from one attractor 
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to the other in a noise induced transition. Such escape problems have been studied 
for a long time in homogeneous systems (Erdi and Toth, 1989; Horsthemke and 
Lefever, 1984). When the homogeneous system gets larger and larger, the average 
escape time from an attractor gets longer and longer. The escape time increases 
approximately exponentially with the volume of the system, as an escape requires 
that an increasing number of unlikely events occur in sequence. The exponential 
dependence of the escape time on the volume is for the double feedback system 
illustrated by the solid line in figure 9.8a. 








Correlation Time (s) 











Cube Side Length (um) 


Figure 9.8 Reduction of escape time and domain separation. a) The correlation time 
for the number of A molecules in the double negative feedback system is plotted as a 
function of the linear extension of the cube shaped system. (The correlation time is the 
time, 7, at which the normalized autocorrelation function (na(t)na(t+7)) /(na)? —1 has 
decreased to e~' of its value at 7=0. The correlation time is one half of the average time 
of escape from one of the attractors in a symmetric bi-stable system.) Inserts: Examples 
of time evolution of the total number of free A and B molecules are given for the points 
indicated by arrows. (The figure is reproduced from (Elf and Ehrenberg, 2004)) b) The 
black and white circles corresponds to free A and B molecules, diffusing together with 
all the other reactants in a sphere with radius 44m. The sphere is divided into 268,096 
subvolumes each of size (0.1jzm)?. The rate of diffusion of all components is d=2-10~° 
cm’s~'. The simulations were done with the next subvolume method using MesoRD. 


When slow diffusion of the reactants is considered, such that the molecules do 
not have time to diffuse through the whole volume between two reactions, the 
escape from an attractor is faster than in the homogeneous case (Elf and Ehrenberg, 
2004). In figure 9.8a this is seen in the reduced correlation times for lower diffusion 
constants with the same size of reaction volume. 

At very low diffusion rates the correlation time even levels off and becomes 
independent of further increase of the system size. Systems at the plateau display 
domain separation, where different parts of the volume are in different attractors. 
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An example of domain separation of the double negative feedback system is seen 
in Fig. 9.8b, where all reactants are freely diffusing in a sphere of radius 4j:m. 
Not all bistable systems will display domain separation when the reactants are 
freely diffusing in three dimensions. However, if diffusion is geometrically obstructed 
such that the number of likely reaction partners is low, the spatial aspects of the 
stochastic bistable systems become important (Bhalla, 2004; Elf and Ehrenberg, 
2004). 
Min-oscillations 
To give an example of how the Min-system behaves in a stochastic setting, the 
model by Huang et al. (2003) was simulated using MesoRD. For the wild type- 
shaped cell the stochastic and deterministic simulations are in good agreement, as 
shown in figure 9.9. 


x 





2 n 
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Figure 9.9 Stochastic simulation of the MinD oscillations. The figure shows MinDE 
complexes (black) and MinD (gray) bound to the bacterial membrane with 5 seconds 
intervals. The 4um cell is modeled as a cylinder with two spherical caps and is divided 
into a membrane and an intracellular compartment. The three dimensional volume is 
discretized in subvolumes with side length 0.05um. The concentrations of molecules and 
other parameters are close to those in the corresponding deterministic model (Huang et al., 
2003). 
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Summary 


We have presented in this chapter approaches for describing the dynamics of biolog- 
ical systems when spatial inhomogeneities cannot be neglected and the transport 
of particles have to be taken into account. In the simplest case, the transport of 
the molecules constituting the system is diffusive. If each molecule has many po- 
tential reaction partners, a mean-field description in terms of reaction-diffusion 
equations is then possible. An analysis of the dynamic equations can in this case 
make use of the powerful tools of differential calculus and often starts with a linear 
stability analysis of stationary homogeneous states of the system. This analysis is 
a systematic and not very time-consuming way to get a first impression of how 
the system behaves for different values of the parameters. Further analysis then 
commonly involves a numerical integration of the dynamic equations. If the change 
in local concentrations cannot be approximated as an average over a large number 
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of random events in each volume element, a stochastic description must be used, 
for example, in terms of the reaction-diffusion master equation. This is a general- 
ization of the chemical master equation presented in chapter 8 and is built upon 
a division of space into subvolumes. For its analysis, Monte Carlo methods have 
been developed. An extensive analysis of the system for many parameter values is 
often not possible because of the large simulation time needed to get good statis- 
tics. Instead, first the continuum limit can be used to identify possible interesting 
parameter values for which an extensive stochastic analysis is then performed. 

In general, transport can rely on different mechanisms than diffusion. The evo- 
lution of particle densities is then still given by the continuity equation (9.15), but 
the currents will differ from the diffusion current. Macroscopic expressions for the 
current are constrained by the symmetries of the system. If the system is close to 
thermodynamic equilibrium, the currents can be expressed as linear combinations 
of the generalized thermodynamic forces. In general, however, there is no systematic 
procedure to arrive at macroscopic expressions, even though symmetries can guide 
their development. If a microscopic model is available, then macroscopic expressions 
can be obtained by a procedure similar to the one presented in section 9.2.1. 

The techniques presented in this chapter have proven extremely useful to describe 
spatiotemporal structures in tissues and organisms. They are now also used to de- 
scribe the dynamics of subcellular structures, as was illustrated by the example of 
the Min-system in F. coli. In general, they can be advantageously used whenever 
a rather limited number of different molecules is sufficient to characterize the state 
of a system. This does not imply that the number of different molecules constitut- 
ing the system must be small. For example, the cytoskeleton contains numerous 
different proteins. Its state can, however, often be sufficiently well characterized 
by the distribution of the cytoskeletal filaments, while the effects of the associated 
proteins can be lumped into a number of parameters characterizing the interactions 
between the filaments. Thereby, the methods presented here complement those used 
for the analysis of biochemical networks presented in chapter 4 and chapter 5. It 
can be expected that they will continue to prove very valuable in discovering gen- 
eral principles underlying the formation of spatiotemporal structures in biological 
systems. 
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Notes 


1. Therefore, continuum theories are often referred to as field theories. 
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2. Another possibility are integro-differential equations. These contain integrals 
over the fields. Integrals over space reflect non-local interactions, for example by 
neurons that connect to distant neurons, while integrals over time reflect a memory 
in the system. 

3. Note that the solutions to the dynamic equations do not necessarily display 
the same symmetries as the equations. For example, the dynamic equations for the 
Min-system in three spatial dimensions do not change when the system is rotated 
around the long axis of the bacterium. In contrast, the fields do not have to be 
invariant under this transformation. 

4. Symbols printed in boldface denote vectors. 

5. In addition to particle numbers, other conserved quantities are momentum 
and energy. Momentum conservation needs to be taken into account if forces 
act on the system or are created within the system. The source terms in the 
continuity equation reflect in this case external forces. For systems operating at 
constant temperature, the continuity equation for energy does not lead to additional 
independent equations. 

6. The diffusion limited association rate constant for two spherical reactants 
freely diffusing in three dimensions is given by ka = ts (Noyes, 1961), D 
is the sum of the molecules’ diffusion constants, p is the reaction radius, and k 
is the association rate constant at the reaction boundary. When k >> 47Dp, the 
reaction is strictly diffusion controlled and ka = 47DR (von Smoluchowski, 1917) 
. The dissociation rate constant ką is similarly diffusion controlled kg = “HSS 

ka à 


(Berg, 1978b), whereas the equilibrium constant Kg = 74 = ĝ is independent of 


the rate of diffusion. Here, À is the microscopic dissociation rate constant. 
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Biological Data Acquisition for System Level 
Modeling—An Exercise in the Art of 
Compromise 


Zoltan Szallasi 


Most of the actual modeling of biological systems will be performed by researchers 
with strong foundations in the quantitative sciences. One of the most significant 
adjustments these experts have to make when entering the field of modeling of 
cellular systems is understanding and accepting the limitations of biological data. 
The system to be modeled, in most cases the living cell, is extremely complex, has 
rather limited observability and may be governed by principles that are beyond our 
current understanding. Most relevant to this book is the fact that we are trying to 
produce predictions for an entire system while only a subset of the variables can be 
measured, often with rather limited accuracy. It will probably take several years of 
intensive research to estimate the constraining effect of measurement techniques on 
system level modeling. This chapter reviews the various biological data acquisition 
techniques and compares their capabilities to the data requirements of various 
modeling techniques and to the estimated complexity of intracellular regulatory 
networks. 





10.1 Chapter Overview 


Most chapters in this volume are dedicated to the theoretical foundations and prac- 
tical realization of complex system modeling. To a significant extent these consid- 
erations are independent from the fact that our intention is to model biological 
systems. The general rules or limitations of ODE-based models (see chapter 6) are 
basically the same when modeling the cell cycle or weather patterns. In fact, it 
is an exciting and open question, whether biological networks display any specific 
systemic property that would sharply set them apart from other complex networks 
(Milo et al., 2004b, 2002). It is currently rather certain, however, that the overall 
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nature of biological networks, such as robustness, or the limitations of accuracy 
while measuring parameters will restrict the models in such a way that will have 
an impact on both the theoretical and practical aspects of biological modeling. In 
terms of theory, the remarkable robustness of biological networks may restrict com- 
plex networks of ODEs in such a way that the predictive power based on them 
may enable the meaningful modeling of large subcellular networks (Stelling et al., 
2004b). In terms of practical limitations, insufficient accuracy or coverage of bio- 
logical parameter estimations may prevent models from producing any meaningful 
predictions. Therefore, it is important for the prospective modeler to get familiar 
with several relevant aspects of biological data acquisition including: 

A) The overall size and complexity of intracellular networks: This includes 
estimating the overall size of the genome in terms of active biochemical units, the 
number of relevant biochemical derivatives per gene and the average connectivity 
of the network. It must be also emphasized that deciphering the active part of 
the genome, especially for higher organisms, is far from being complete, and this 
dynamic research field yielded several major surprises during the last couple of 
years leading to a significant reevaluation of our understanding of how the genome 
is organized in functional terms. 

B) The general principles of biological measurements — their technical and con- 
ceptual limitations: The various modeling approaches rely on rather different data 
requirements. Therefore, clear estimates on the accuracy, coverage, and sensitivity 
of a given data acquisition technology will determine its suitability for a given com- 
putational task. Graph theoretic approaches or flux balance analysis, for example, 
usually involve a significant part of the entire intracellular networks without a need 
for estimating kinetic parameters or concentrations. Ordinary differential equation— 
based dynamic modeling, on the other hand, is usually performed on rather limited 
subnetworks, with its success highly dependent on accurately estimating kinetic pa- 
rameters and concentrations. The very crux of data acquisition in systems biology 
is the current trade-off between coverage and accuracy. 

C) Concentration measurement versus kinetic parameter measurements: Al- 
though the detection technology is the same for both types of measurements, es- 
timating kinetic parameters relies on time-dependent measurements, and they are 
also highly dependent on the experimental environment, whether, for example, the 
measurement was performed in free solution or directly in the intracellular environ- 
ment. Consequently, determining kinetic parameters that reflect the intracellular 
reality will take more specialized approaches than those applied for concentration 
measurements. 

D) The actual target of the measurements: Averaging biological measurements 
across cell populations, as most currently applied methods require, will mask 
important aspects of regulatory interactions in the individual cells. Therefore, the 
prospective modeler should be aware of situations when single cell measurements 
provide more relevant data. 
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10.2 The Estimated Size and Complexity of Intracellular Regulatory Networks 


As we will see in chapter 15, modeling large dynamic networks leads to formidable 
computational challenges. Therefore, it is desirable to start with the least complex 
network, or with the smallest subset of a large network that provides correct 
predictions or helps answer a set of specific questions. However, it is not known how 
large a segment of the entire network has to be modeled in order to be able to predict 
a certain cellular behavior. This is also a context-dependent problem: a dynamic cell 
cycle model of limited complexity may provide a good description of the behavior of 
normal cells but may fail to provide a meaningful description of the neoplastic cell 
cycle, which may involve several regulatory interactions that can be ignored in the 
normal cell. For example, the BCR-ABL fusion protein is never present in normal 
cells and gets created during chromosomal translocation, a hallmark mechanism of 
cancer cells. This abnormal protein, which is not part of the mechanistic description 
of the normal cell cycle, has a significant regulatory input on some parts of the cell 
cycle machinery in leukemia (Gesbert et al., 2000). A modular view of biology has 
been proposed to alleviate some of the computational problems associated with 
system level modeling (Hartwell et al., 1999; Stelling et al., 2004b). However, the 
existence of modular structure in biological networks is far from being resolved (see 
chapter 3). Therefore, it is worth providing some quantitative estimates on the size 
and complexity of the entire intracellular regulatory network. 


10.2.1 The Inventory of Biochemical Entities in an Intracellular 
Regulatory Network 


A reasonable starting point for estimating the size of intracellular networks is 
the number of active genes in a given cell. This can be fairly well estimated for 
prokaryotes. The genome of these relatively simple unicellular organisms contains 
from 500 up to 6,000-7,000 genes, which are rather tightly packed and free of 
the complexities observed in higher organisms, such as introns or splice variants 
(Brown, 2002). The vast majority of transcribed genes code for proteins, therefore 
the number of potentially relevant network variables can be relatively safely deduced 
from the number of genes for these organisms. 

Estimating the total size of the intracellular network of higher organisms is a 
significantly more difficult task, especially in the light of several recent unexpected 
discoveries. The first pass analysis of the recently finished genomes focused on 
the identification of protein coding regions and other widely studied nonprotein 
coding genes such as ribosomal RNA or micro RNAs. This yielded an estimate 
for the total number of genes between 6,000 for yeast and about 20,000—25,000 
genes for humans (Brown, 2002) although this latter number is still changing 
considerably. First it was being adjusted from a much higher number to a lower 
level (International Human Genome Sequencing Consortium, 2004), then new 
experimental evidence suggested that several thousand genes were missed by earlier 
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analysis (Saha et al., 2002). Despite these uncertainties, it is generally accepted that 
for most organisms the number of protein coding genes will not exceed 30,000- 
35,000 (Johnson et al., 2005). However, three surprising lines of evidence suggest 
that protein coding genes may not be the full story and we might be considerably 
underestimating both the total number of genes and the active part of the genome. 
First, a significant portion of all genes seems to be transcribed in the antisense 
direction, in addition to the sense transcription, both in prokaryotes and eukaryotes. 
In human cells, for example, up to 20% of genes may be transcribed in the antisense 
direction as well (Lehner et al., 2002; Yelin et al., 2003). Second, splice variants will 
considerably increase the overall diversity of the transcriptome of most eukaryotes, 
especially higher organisms. It is estimated that at least half of the human genes 
are alternatively spliced and a single gene may have a large number of potential 
splice variants (Modrek et al., 2001). Third, a recent set of papers employed the 
so-called “tiling” microarray technology, in which large regions of entire genomes are 
expression profiled using oligonucleotide microarray probes that cover the genome 
at regular, closely placed intervals. These probes are designed in an unbiased fashion 
and cover intronic and intergenic regions of the genome in addition to the usually 
examined exonic regions. Surprisingly, a large number of nonexonic probes showed 
significant expression levels, suggesting the existence of a large number of thus far 
unidentified RNA species (Johnson et al., 2005). It remains to be seen whether 
these regions of the genome code for proteins or regulatory RNA. Nevertheless, the 
recent, revolutionary impact of short regulatory RNA for biological research should 
serve as ample warning that we should be prepared for further surprises (Bartel, 
2004). 

Actively transcribed distinct RNA sequences comprise only the first layer of 
complexity of intracellular biological networks. RNA strands associated with protein 
coding genes are transported to the ribosomes where they serve as templates 
for protein production. This, however, is only the starting point for a series 
of posttranslational modifications that are necessary for proteins to exert their 
respective effects. It should be noted that a whole series of regulatory events exists 
between the transcription of a certain gene and the various active protein derivatives 
of the same gene, and these regulatory events often receive multiple conditional 
inputs from an array of other elements in the network. Therefore, a given protein- 
coding gene may have several biochemical derivatives, which may require separate 
introduction into a given model (Hoffmann et al., 2002; Schoeber! et al., 2002). A 
demonstrative example is shown in Figure 10.1 for the several steps involved from 
the production of mRNA of a transcription factor until the production of mRNA 
of a downstream-regulated gene. 

The protein product of the gene “relA” is part of the NF-«B transcription factor 
complex, either with another identical RELA molecule as a homodimer, or with 
one of several other proteins as a heterodimer (Karin et al., 2002) (small letters 
usually designate RNA whereas capital letters are used for the protein product of 
the same gene). We start at the state when the mRNA of relA is already produced. 
In addition to the transcriptional regulation, the level of mRNA of this gene can 
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Figure 10.1 Independently regulated derivatives of the relA gene. (For details see text.) 
The black arrows indicate independent regulatory inputs. 


be regulated by the stabilization or destabilization of mRNA. The level of protein 
production will be proportional to the net amount of relA mRNA and not only 
to the transcriptional activation of this gene alone. mRNA is the first regulated 
derivative of the relA gene. All proteins are produced in the cytoplasm in a non- 
modified form, and the RELA protein has to be first translocated to the nucleus 
to exert its transcriptional activity. The non-phosphorylated cytoplasmic and non- 
phosphorylated nuclear RELA protein, therefore, can be considered as two further 
derivatives of the relA gene, since the IKB proteins will regulate the localization of 
the NF«B complex in a conditional manner (Karin et al., 2002). The activity of 
the nuclear RELA protein is further regulated by phosphorylation at various serine 
residues (Duran et al., 2003). Therefore, the nuclear phosphorylated form of RELA 
can be considered as an additional derivative, since both the function and the regu- 
lation by stabilization differ for the phosphorylated and non-phosphorylated form. 
As shown in Figure 10.1, the gene relA has at least four independently regulated 
derivatives: its mRNA, the non-phosphorylated cytoplasmic, non-phosphorylated 
nuclear, and the phosphorylated nuclear forms. When building a dynamic model, 
these derivatives have to be entered into the model as separate entities (Hoffmann 
et al., 2002). 

Therefore, the second step in determining the overall size of intracellular regula- 
tory networks is estimating the number of relevant posttranslational modifications 
per protein. Various estimates put the number of distinct posttranslational forms of 
a given protein in yeast to about 3 and in humans to 3-6 (Banks et al., 2000; Papin 
et al., 2005). For prokaryotes, this number seems to be less than two. These are obvi- 
ously rough and preliminary estimates, and high quality, manually curated protein 
databases will certainly provide more reliable numbers in the future (O’Donovan 
et al., 2001; Peri et al., 2003). 
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These numbers will be further increased by the fact that the same protein 
may have to be accounted for according to various, relevant localizations, as 
certainly seems to be the case for membrane associated receptors or nuclear proteins 
(Schoeber! et al., 2002; Smith et al., 2002) 

The above-described rather staggering complexity seems to dwarf the more 
moderately sized collection of small molecules in a cell, which is commonly referred 
to as the metabolome. The total number of small molecules in any given organism, 
including humans, will probably not exceed 2,000-2,500 (Kell, 2004). 

Taken together, based on the collection of genes and their derivatives, it seems 
that the number of independently regulated biochemical species will be between a 
few thousand for the simplest organisms and several hundreds of thousands for 
more complex organisms, such as humans. These numbers will probably elicit 
a wide variety of responses in the newcomers to the field, varying from total 
hopelessness to cautious optimism. On the one hand, even a relatively small network 
of ordinary differential equations can get out of hand rapidly (see chapter 6). 
On the other hand, the various constrained models, such as flux balance analysis 
(see chapter 5), provide meaningful predictions about biological systems based on 
networks of approximately a thousand metabolites (Edwards et al., 2001a). Control 
theoreticians also like to point out that a Boeing 777 contains about 150,000 
subsystem modules, significantly more than the number of “relevant parts” that 
seem to be in a simple bacterium (Csete and Doyle, 2002). We are, probably, far 
from certain whether this is a fair comparison. The effect of stochasticity in biology 
(see chapter 8), and the implications of human, control theory—based design (see 
chapter 12) need to be accounted for and understood before a modeler gets carried 
away by such an optimistic comparison. 

It may also be informative to take a look at the total number and the dynamic 
range of macromolecules per cell. From the size and dry weight content of cells and 
the average size of proteins or RNA, one may easily arrive at the following estimates 
(see for example www.dur.ac.uk/biological.sciences/Staff/Croy/GENNET1.HTM). 
A large cell, such as a hepatocyte (liver cell) is estimated to contain about 8*10° 
protein molecules. This number is distributed amongst 10,000 different types of 
proteins with a dynamic range of about 5 orders of magnitude. 


10.2.2 The Inventory of Regulatory Interactions in Intracellular 
Networks 


In addition to the “number of parts,” (nodes in graph representations, see chapter 7), 
the number of regulatory interactions (edges in a graph), is also an important 
characteristic of intracellular regulatory networks. These interactions can be derived 
by a wide variety of large-scale measurement techniques that detect various types of 
regulatory interactions. In principle, any method can be (and has been) used, which 
can produce some measurable phenotype on a large enough number of genes. Some 
types of interactions are directly probed by an actual measurement and readily 
interpretable in biochemical terms, such as protein binding. Others are derived 
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indirectly by computational analysis of other data sets. For example, regulatory 
interactions have been postulated for genes that are coregulated in a large number 
of perturbed microarray based gene expression profiles (Ihmels et al., 2002; Tirosh 
and Barkai, 2005), or between genes that show synthetic lethality in double genetic 
knock-outs (Tong et al., 2004). In these latter cases it is often difficult to directly 
identify the actual biochemical mechanism(s) behind the postulated regulatory 
interactions. 

Thanks to the accessibility of appropriate data, large protein interaction networks 
have been extensively studied, and cross-validating the various data sets has led to 
reliable estimates on the number of these interactions. Experimental data were 
produced both by small-scale protein interaction assays deposited in databases 
such as the Database of Interacting Proteins (DIP), and Human Protein Reference 
Database (Peri et al., 2003) and by high throughput technologies such as yeast 
two-hybrid measurements, high throughput mass spectrometry, etcetera. (see for 
example (Lee et al., 2004)). For yeast, most recent estimates suggest on the order of 
30,000-35,000 interactions for the entire genome, yielding roughly 6-7 interactions 
per protein. 

Human data sets are currently more sparse and biased (remember we do not even 
know the total number of human protein coding genes, let alone their identity). It 
may still be informative that the existing, supervised data sets currently contain 
3-4 interactions per protein (Peri et al., 2003). 

In summary, intracellular regulatory networks for the various organisms can be 
probably visualized by graphs with a total node number between a few thousand 
and perhaps a few hundreds of thousands, and with an average connectivity of less 
than 10. 





10.3 Classifying Measurement Techniques from a Computational Modeling 
Perspective 


The purpose of the following classification is to provide guidelines for prospective 
modelers when looking for or intending to produce appropriate data sets for a given 
modeling problem. An appropriate classification will reflect both the needs of a given 
modeling approach and the capabilities of the various measurement techniques (see 
Figure 10.2). 


10.3.1 The Target of the Measurement 


Section 10.2 is essentially a “list of parts” of intracellular networks. The most nu- 
merous group of relevant quantifiable variables comprises, of course, the genes and 
their derivatives. They can be measured at the DNA level, RNA level, and the 
various levels of protein modifications along with their localization. Metabolic net- 
work analysis requires the quantification of metabolites. The recently popularized 
suffixes -omics or -ome describe the measurement or cataloging of an entire collec- 
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tion of one type of biochemical molecules. The “genome” refers to the entire genetic 
information stored in the DNA strands of a given organism, “transcriptome” com- 
prises all the genetic information that gets transcribed into RNA, etcetera. From 
a practical point of view, it is worth noting that all members of a given “ome” 
can be usually measured by the same type of technology, whereas measurement 
technologies usually transfer poorly between the various “-omes.” For example, a 
given oligonucleotide-based microarray platform can be, at least in principle, used 
for the measurement for any RNA species. On the other hand this technology is 
not suitable for the measurement of, for example, phosphoproteins. Merging data 
sets across various “-omes” and across various technologies is not an obvious task 
(Luscombe et al., 2004) and usually requires well thought-out, specialized methods 
such as Bayesian approaches (see chapter 4) (Lee et al., 2004). 


10.3.2 Concentration versus Interaction Measurements 


Concentration measurements can be performed accurately in a “context indepen- 
dent” manner. After destroying the cell, a necessary preparative step in many 
cases, the number of molecules can still be counted accurately by any of the well- 
established measurement technologies as outlined below. Measuring molecular in- 
teractions, however, is dependent on the “cellular context.” An interaction detected 
in free solution may never occur inside the cell. Therefore, measurement techniques 
that reflect the reality of the inside of the cell had to be developed. 


10.3.3 The Information Content of Measurements 


In order to perform a given computational modeling task, a certain amount of 
experimental information is needed. For example, one can estimate the amount 
of data that is necessary to reverse engineer a given regulatory network (see 
chapter 11) (Andrec et al., 2005; Sontag et al., 2004). The accuracy and sensitivity of 
measurement techniques along with the strategy of selecting appropriate conditions 
or time points of the samples to be quantified all have a profound impact on the 
useful information content of an experiment (Szallasi, 1999). Therefore, one might 
be able to estimate whether a given experimental technique is suitable to produce 
appropriate data for a modeling approach. Some of the experimental techniques 
are mainly able to identify biochemical molecules without the power of providing 
anything more than semi-quantitative concentration estimates. Mass spectrometry— 
based proteomics without isotope labeling is an example that is discussed later 
on. However, these measurements, even without a quantitative dimension, can 
be used for topological understanding and modeling of protein networks (Lee 
et al., 2004). Other measurements, such as gene expression microarrays, provide 
information about the direction and magnitude of gene expression changes. Whereas 
the direction of changes seems to be rather reliable, the ratio of changes seems to be 
compressed in an intensity dependent manner (Yuen et al., 2002). Finally, several 
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low throughput techniques, such as Western blot analysis (see below), provide 
accurate measurements with an error rate well within 10-15%. 


Absent/present Mass spectrometry analysis Graph models 
of protein interaction clusters 


Reliable detection of 


expression changes with Microarray measurements Probabilistic/ 
limited precision of of gene expression profiles Qualitative models 
the actual expression 

ratios 

Concentration Western blot analysis of Detailed, ODE-based 
measurements protein concentrations dynamic models 


with an experimental 
error of 10-20% 


Figure 10.2 Different biological data acquisition technologies produce results with 
rather different measurement accuracy. The precision of the method will determine its 
potential utility for a given modeling approach. 





10.4 Low Throughput, Accurate Measurements of Gene Derivative Concentrations 


The history of biochemical measurements has been a long struggle for increased 
specificity and sensitivity. In order to describe a biological system at the quantitative 
level, one would like to measure the various biochemical derivatives, such as the 
various posttranslational modifications of the relevant genes at various localizations 
within the cell, preferably with a reasonable time resolution. Needless to say, this 
is hard to achieve, and current methods involve various levels of trade-offs between 
the number of biochemical species to be quantified and accuracy. 

Measuring the concentration of a single, well-defined biochemical species in a solu- 
tion is well within the capabilities of modern molecular biology. Most high precision 
methods are based on the combination of size separation and the application of a 
specific, high affinity reporter system. Size separation methods are usually based 
either on gel or capillary electrophoresis, where the macromolecules of various sizes 
are driven through a molecular sieve by electrostatic potential. The sieve is formed 
of polymers such as polyacrylamide or agarose, that are cross-linked to produce the 
appropriate pore size that is best suited to separate the molecular weight range of 
the molecules of interest. Gel electrophoresis produces only a rather limited size 
resolution, therefore high specificity reporters are needed for accurate identification 
and quantification. As outlined later, for certain types of biopolymers, in particular 
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nucleic acid chains, highly specific reporters can be produced with relative ease, 
whereas for others, such as polypeptide chains, the availability of specific reporters 
(antibodies) still depends on processes that cannot be easily controlled. 

DNA or RNA fragments are first size separated by agarose gels during Southern 
and Northern blot analysis, respectively. The specificity of reporters is based 
on the Watson-Crick pairing. Under appropriate experimental conditions, a long 
enough nucleotide sequence with limited sequence homology to other parts of the 
genome will hybridize only to its target sequence. The probes are labeled by the 
incorporation of radioactive or fluorescently labeled nucleotides, which, in turn, will 
produce readily measurable signals that could be used for quantifying the DNA or 
RNA fragment in question. The experimental error is usually below or around 
10-20% in the hands of an experienced user, and detection limits, thanks to new 
technology such as quantum dots, are in the sub-femtomolar range (Liang et al., 
2005). This should allow the quantification of RNA molecules that are expressed 
at the level of 1 molecule per cell. This is, in fact, a necessary level of sensitivity 
because a significant number of transcripts are expressed at this level (Holland, 
2002). 

Quantitative RT-PCR also exploits the specificity of Watson—Crick pairing. In 
this case, two specific, most often fluorescently labeled PCR primers are used that 
will initiate the amplification of only the target nucleotide sequence. This process is 
kinetically measured and can be reliably used to estimate the starting concentration 
of the target sequence. It has a similar accuracy to Northern blots, and it was used to 
measure the concentration of RNA species below the concentration of one molecule 
per cell (Holland, 2002). 

Protein concentrations are routinely quantified by Western blot analysis. In 
this, a protein mix is first size separated by polyacrylamide gel electrophoresis 
(Laemmli, 1970), then the specific reporter system is applied. For proteins, no 
convenient method similar to the Watson-Crick pairing exists to produce highly 
specific probes. It took decades and a great deal of ingenuity to work out effective 
methods to produce antibodies for analytical purposes (Harlow and Lane, 1988). 
Quite remarkably, today’s antibodies provide highly specific probes not only for 
individual proteins but for the various posttranslational modifications, such as 
specific phosphorylation states, of a given protein as well (Czernik et al., 1991). 
Several ordinary differential equation—based modeling studies took advantage of 
the specificity and accuracy of data produced by such antibodies in Western 
blot analysis (Hoffmann et al., 2002; Schoeberl et al., 2002). The accuracy and 
sensitivity is similar to Northern blots. However, it should be noted that the 
production of specific antibodies against a large number of diverse proteins is still 
a significantly more labor-intensive experimental project than producing probes 
against nucleic acid sequences, which, to some extent, could be reduced to a 
computational problem. 

From a modeling perspective, it is important to note that the above-described 
low throughput methods usually produce reliable, specific and rather accurate 
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measurements. Hence their popularity for parameter fitting in dynamic models 
(Hoffmann et al., 2002; Schoeber! et al., 2002). 

It should be also remembered, however, that in most cases the high specificity 
and accuracy of the above described methods relies heavily on the actual analyt- 
ical conditions applied, which should be carefully adjusted for each probe sepa- 
rately. Antibody and nucleotide probe concentrations, salt concentrations of the 
hybridization and washing buffers, and hybridization temperatures should all be 
carefully optimized for maximum specificity. These parameters will of course vary 
significantly from target to target, and this fact already forecasts the difficulties 
encountered with high throughput methods, such as microarrays, that are based on 
multiplexing the above-described methods. 

In principle, given high specificity probes, one can eliminate the size-separation 
step. In this, the so-called dot blot technique, one can immobilize either the probes 
or the sample mixture to solid support and then hope that there is only one 
biochemical entity binding to the probe (Maniatis et al., 1982). In traditional 
biochemistry, this approach was rather marginally applied — results always looked 
more convincing when supported by the correct size information. Interestingly, this 
method started a second, spectacularly successful life in the form of microarray 
technology. 





10.5 High Throughput Measurements and Low Accuracy—A Necessary 
Compromise? 


10.5.1 High Throughput Gene Expression Measurements 


In principle, given enough manpower and financial support, every biochemical 
measurement can be scaled up in a massively parallel fashion even to genomic scale. 
In fact, the current interest in system level modeling was started by the introduction 
of massively parallel measurement techniques, such as gene expression microarrays 
(Schena et al., 1995). These are essentially highly efficient multiplexed dot blots, 
enabled by microfabrication and automatization. Varying and patenting essential 
experimental details led to the development of a large number of alternative 
microarray platforms (Hardiman, 2004). Nevertheless, microarray based RNA or 
DNA quantification methods are all based on the same basic principles. Nucleotide 
probes of varying lengths, from 25 base pairs up to hundreds of base pairs, are either 
immobilized or in situ synthesized on solid support. The RNA or DNA mixture to 
be quantified is labeled, most often by fluorescent dyes, and then hybridized to the 
microarray chips. An enormous amount of work, thousands of publications, went 
into working out both the experimental and computational analytical details of 
optimal microarray analysis. These efforts were often hindered by legal interference 
from manufacturers (Rouse and Hardiman, 2003) and the unavailability of essential 
information, such as microarray probe sequences, for the research community 
(Mecham et al., 2004). The inordinate number of relevant technical publications 
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suggests a rather limited satisfaction with the accuracy of microarray technology. 
Here we can summarize only the most relevant concerns, and we have to refer to 
appropriate reviews for further details (Jordan, 2004; Draghici et al., 2006). 

Due to low cost efficiency, the estimated accuracy of microarray measurements 
is only sparsely supported by independent verification data. A typical microarray 
platform contains thousands or tens of thousands of probes, but most studies will 
verify the quantification provided by microarrays only for a much smaller number 
of genes, typically less than one hundred (see for example (Gold et al., 2004; Hol- 
land, 2002)). A few recent studies quantified gene expression levels by quantitative 
RT-PCR from several hundred to over a thousand genes and a couple of interinsti- 
tutional efforts are also underway to perform similar validation (Czechowski et al., 
2004; Holland, 2002). However, the overall level of accuracy by microarray mea- 
surements is far from being established(Draghici et al., 2006). 

The lack of independent verifications was intended to be replaced by cross- 
platform comparison of RNA aliquots, which is, however, an imperfect tool with 
which to validate microarray platforms. Lack of consistency can be caused by the 
inferior performance of either one or both platforms, without clear indication of their 
relative merit. On the other hand, highly similar results across platforms could be 
simply caused by consistent cross-hybridization patterns without either platform 
measuring the true level of expression. Current experience in the field suggests 
that short oligo based microarray platforms show a rather good correlation, with 
a Pearson correlation coefficient of about 0.7 or better (Bammler et al., 2005; Woo 
et al., 2004; Yauk et al., 2004). cDNA microarray—based results, however, have a 
more limited correlation with short oligonucleotide based platforms, around 0.5 on 
average (Bammler et al., 2005; Mecham et al., 2004; Woo et al., 2004). It must be 
noted that these correlations are always based on gene expression ratios between 
two different RNA samples. Despite some optimistic reports (Hekstra et al., 2003), 
absolute levels of gene expression can hardly be estimated using only microarray 
data. This problem is best exemplified when looking at microarray data produced 
by a platform, for example, the Affymetrix gene chip, that uses multiple probes 
against the same transcript. Probes that are producing signals by hybridizing 
to the same transcript may show orders of magnitude variations in their signal 
intensity(Draghici et al., 2006). Surface chemistry, significantly different free energy 
binding values between probes, cross hybridization, or the efficiency of labeled 
nucleotide incorporation probably each have an effect on the poorly understood 
correlation between signal intensity and target concentration. As a consequence, 
while gene expression ratios can be estimated with reasonable certainty for a 
significant number of genes, measuring absolute concentrations in a comprehensive 
fashion is currently beyond the capabilities of microarray technology. 

Microarray analysis is based on a strong, although rarely discussed, assumption: 
most microarray probes on a given platform produce sufficiently specific signals un- 
der a single, rather permissive hybridization condition(see Figure 10.3). Increasing 
evidence suggests that this might be true only for a subset of probes on any mi- 
croarray platform (Zhang et al., 2005). Consequently, while gene expression ratios 
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Figure 10.3 Massively parallel measurement technologies, such as gene expression mi- 
croarrays, are usually run under a single experimental condition, which is not ideal for 
most of the individual reactions. Each microarray probe can be associated with an ideal 
hybridization condition determined by the hybridization temperature, salt concentra- 
tions, etcetera. This is expected to produce the most specific signal with minimal cross- 
hybridization. The ideal hybridization temperatures (T;) and salt concentrations ([Na‘];) 
are usually probe-sequence dependent, producing different values for the two microarray 
probes highlighted in the figure. Nevertheless, gene chips are routinely run under a single 
set of conditions (Tehip, [Na* cnip). A similar one-size-fits-all strategy is often implemented 
for economic reasons. 


might be estimated for some genes with an error of less than 20-30%, the error for 
other genes may be far greater than that. There is only little, if any, guidance in the 
literature that would help with predicting the accuracy of a microarray measure- 
ment for a given gene in a particular experimental setting. The general expectation 
is that two-fold changes in gene expression can be reliably measured across the 
board. The detection limit of current microarray technology is around 10 copies of 
mRNA per cell (Holland, 2002; Kane et al., 2000). It should also be noted, that this 
level of sensitivity may be insufficient to detect relevant changes in low abundance 
genes, such as transcription factors (Holland, 2002). 

A possible compromise between the relatively inaccurate microarray technology 
and low throughput, high precision methods is running a large number of quan- 
titative real-time PCR reactions in a parallel fashion using, for example, 384-well 
optical plates. A recent study quantified 1,400 genes in Arabidopsis with high re- 
producibility and high sensitivity (0.001 mRNA copies per cell) over a six-orders- 
of-magnitude dynamic range (Czechowski et al., 2004). It remains to be seen to 
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what extent the increased accuracy and sensitivity may be able to offset the impact 
of lower coverage and higher initial labor cost of PCR, primer design. 


10.5.2 High Throughput Protein Quantification 


Soon after the runaway success of gene chip technology, antibodies were arrayed 
on solid phase support in order to develop protein microarrays (Haab et al., 2001). 
For proteins, unfortunately, neither sample labeling nor probe preparation is as 
straightforward as in the case of nucleic acids. This resulted in a much slower 
development, when researchers had to try several alternative detection methods 
(MacBeath, 2002). The detection limits of protein microarrays are perhaps not 
much worse or are similar to that of gene chips (1 part in 1,000,000), especially 
when preceded by fractionation, in which the relative concentration of the protein 
of interest is increased at the expense of other proteins. For example, high-speed 
centrifugation can remove the abundant cytoskeletal, structural proteins, which, 
this way, will be prevented from interfering with the quantification of soluble 
proteins. There has not been a large enough body of experience published yet 
that would provide a comprehensive estimate on the accuracy antibody microarray 
measurements. A consistent and reliable detection of 2-fold changes would perhaps 
satisfy most current users. 

Detecting proteins with high specificity on protein microarrays depends on 
“luck” whether an appropriate antibody can be developed for a given protein. 
This and other, detection-related, problems associated with protein microarrays 
were probably not lost on experts who were developing alternative technologies 
for high throughput quantification of protein mixes. By far the most powerful 
and most widely used of these methods is mass spectrometry—based proteomics. 
Identification of proteins is based on measuring the mass-to-charge ratio of ionized 
protein fragments, and their quantification is based on counting the numbers of a 
given ionized fragment reaching the detector. Mass spectrometry requires by far the 
most complex sample processing and instrumentation of all the methods discussed 
so far. First, protein mixes are usually fractioned in order to reduce the complexity 
of the sample. This is rather important, because abundant proteins may outnumber 
rare proteins by four to five orders of magnitude, thus obscuring the signals obtained 
from less abundant proteins. Then, the fractionated protein mix is subjected to 
tryptic digestion in order to produce smaller fragments that are later ionized. The 
peptide fragments are then separated by liquid chromatography. These fragments 
are ionized and then analyzed by, usually, two tandem mass spectrometry analyses. 
The second mass spectrometry is run on fragments derived from a single mass-to- 
charge peak derived from the first mass spectrometry step. For each of these steps 
a multitude of techniques exist with their relative advantages and disadvantages 
reviewed by (Aebersold and Mann, 2003). Here we will review only those aspects 
of mass spectrometry that are relevant for system level modeling. 

Standard mass spectrometry is not appropriate for accurate quantifications with- 
out the further modifications discussed below. Qualitative modeling methods, such 
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as graph theoretic interpretations of intracellular regulatory networks, can take ad- 
vantage even of these non-quantitative data. For example, a significant portion of 
the yeast “interactome” was mapped out recently by the mass spectrometric analysis 
of protein complexes that were isolated by several thousand various tagged “bait” 
proteins (Gavin et al., 2002; Ho et al., 2002). (In this case the tag is a short peptide 
sequence attached to the end of the native amino acid sequence that allows a high 
affinity separation of the bait protein and its interacting partners. While “protein- 
tagging” is a widely used and efficient technology it must be noted that the tag 
may influence the behavior of the tagged protein.) The mass spectrometry—based 
interactomes were combined with other types of interactome data sets yielding a 
probabilistic functional network of yeast, mapping out potential modules or clusters 
for further analysis (Lee et al., 2004). Although the principle of protein identifica- 
tion may sound deceptively simple, its associated difficulties become apparent upon 
closer inspection. Tryptic digestion fragments rarely have a unique mass-to-charge 
ratio (Alterovitz et al., 2006). Hence the need for a second mass spectrometry step 
on the fragment ions of a given peptide peak identified in the first step. However, 
the fragment ion spectra cannot be readily converted into peptide sequences solely 
based on theoretically expected distributions. Instead, the spectra generated during 
the second mass-spectrometry step are usually compared to comprehensive protein 
sequence databases. Therefore the success of protein sequencing will highly depend 
on the quality of reference databases. A large number of “machine learning’—inspired 
methods have been suggested to overcome this problem with varying success (Al- 
terovitz et al., 2006). Nevertheless, considering the speed of the development of mass 
spectrometry based proteomics, there is little doubt that for organisms with com- 
prehensive lists of sequence information, protein identification by this technology 
will be achieved within the foreseeable future. 

Quantifying ratios of protein concentrations by mass spectrometry involves an- 
other dimension of technical challenges and usually relies on stable isotope labeling. 
In this, one protein mixture (for example, cancer tissue), is isotope labeled by one 
of several appropriate methods (Aebersold and Mann, 2003) while the other sam- 
ple (for example, normal tissue), is left unchanged. The normal and cancer samples 
are then mixed. Since the chemical properties of a given peptide are still the same 
after isotope labeling, a mixture of the isotope labeled and native peptide can be 
co-analyzed by the mass analyzer. The difference in mass due to isotope incor- 
poration will yield two different mass-to-charge peaks and the difference in the 
area under those peaks will provide an estimate for the relative expression level of 
that protein in the two samples. A wide variety of ingenious isotope tagging tech- 
niques have been developed, targeting various peptide side-chains such as sulfhydril 
groups (Gygi et al., 1999), amino groups (Munchbach et al., 2000), etcetera. The ac- 
tual side-chain targeted and the protein purification techniques preceding the mass 
spectrometry will restrict the number of different proteins quantified in a given 
experiment (Gygi et al., 1999). 

The accuracy of these technologies usually allows the reliable identification of 
at least 1.5- to 2-fold changes, although this estimate probably applies only to 
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more abundantly expressed proteins (Gygi et al., 1999). The actual sensitivity 
of this technology is highly dependent on the purification steps preceding the 
actual mass spectrometry analysis. In absolute terms, the detection limit of proteins 
isolated from polyacrilamide gel electrophoresis bands and then subjected to mass 
spectrometry is in the femtomole range, or on a weight basis, it is in the low 
nanogram range (1-5 ng). This would roughly translate into a detection limit of 1 
part per million. This ratio may reflect very different sensitivity levels in terms of 
copy number per cell depending on the pre-electrophoresis fractionation. 


10.5.3 Further Uses of Mass-spectrometry 


The universal principle underlying this technology offers itself to a wide variety 
of exploitations. In fact, given appropriate sample preparation methods and ad- 
equate mass spectrometry databases (to which the mass—to—charge peaks can be 
compared), any biologically relevant molecule can be, at least in principle, identified 
and quantified. Whether mass spectrometry is applied to a given task, it is highly 
dependent on cost efficiency, ease of use, and other practical considerations. For 
example, although nucleic acids can be just as well analyzed by mass spectrometry 
as proteins (Jurinke et al., 2004), for the average end user microarray analysis offers 
a cheaper and easier alternative. While the mass spectrometric analysis of a single 
biopolymer (that is a single band after gel purification) costs around one hundred 
dollars, a cheaper microarray platform may quantify several thousand genes for 
the same cost. When no such easily scaleable alternative technique exists, mass 
spectrometry provides an excellent general tool for high throughput biochemical 
measurements. It has been used for the quantitative analysis of the metabolome 
(Allen et al., 2003), identification and to some extent the quantification of tyro- 
sine phosphorylation (Rush et al., 2005), protein ubiquitination (Peng et al., 2003), 
etcetera. These recently developed applications, however, currently belong to the 
realm of semi-quantitative methods. 





10.6 Detecting Regulatory Interactions and Quantifying Kinetic Parameters 


As we saw in section 10.4, concentration measurements can be rather accurately 
performed on a wide variety of intracellular molecules. Under certain experimental 
conditions this accuracy can be transferred to other types of measurements as well, 
since measuring kinetic parameters can be reduced into a time-series of concen- 
tration measurements, and detecting regulatory interactions can be reduced into 
a combination of appropriately modified concentration measurements. However, in 
many cases seemingly accurate measurements may provide misleading data about 
the actual intracellular conditions. 
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10.6.1 Detecting Regulatory Interactions 


These measurements are usually based on simple modifications of several, above- 
discussed methods. One member of the interacting molecules is designated as the 
“bait,” which is used to separate the entire macromolecular complex. The success of 
the method depends on whether the bait molecule can be targeted with sufficient 
specificity. If yes, for example, by using an appropriate antibody, then the whole 
complex is isolated and its members are identified or quantified by standard meth- 
ods. Protein-protein or chromatin co-immunoprecipitation are such techniques. In 
the former, by using an appropriate antibody, a specific protein, such as a tran- 
scription factor, is isolated, and then other proteins interacting with it can be deter- 
mined. In the latter, the same antibody can be used to pull down the transcription 
factor and with it the DNA regulatory regions to which the transcription factor is 
binding. The nucleotide sequence of this regulatory region can then be determined. 
In order to preserve the intracellular regulatory interactions, chemical cross linkers 
are often applied. These molecules have two highly reactive moieties. Within a cer- 
tain distance, called “spacer arm length” (measured in Angstroms), these molecules 
tend to cross-link their target macromolecules by covalent binding. Therefore, when 
the cross-linkers are applied to the cell, interactions between proteins, for example, 
are fixed and carried over to subsequent analytic steps performed in free solution 
(Agou et al., 2004). 


10.6.2 Quantifying Kinetic Parameters 


Dynamic, for example ODE-based, models require accurate kinetic parameters that 
are usually determined by direct biochemical experimentation. In most published 
models these parameters are usually extracted from the literature. These models 
are usually based on individual gene derivatives and not, for example, on functional 
modules. Therefore, the kinetic parameters have to reflect the dynamic interaction 
between individual genes and proteins. For example, an ODE-based model of the 
epidermal growth factor (EGF) receptor pathway requires measuring the affinity 
between EGF and its receptor, the kinetic parameters of the tyrosine phosphoryla- 
tion of the EGF receptor, etcetera (Schoeberl et al., 2002). Most often these kinetic 
parameters are determined in experiments containing a more or less purified pop- 
ulation of the interacting proteins in free solution (that is not measured inside the 
cell.) Parameter optimization (Mendes and Kell, 1998) acknowledges the fact that 
the thus obtained kinetic parameters may often be misestimated relative to the true 
values reflecting intracellular conditions. In order to obtain more accurate estimates 
on these parameters, direct measurements on intracellular protein levels, without 
disrupting the cellular structure, were introduced. The actual tool, as so often the 
case in biology, was provided by nature itself in the form of fluorescent proteins 
(Tsien, 1998). The first of these, the green fluorescent protein was isolated from a 
jellyfish that, for reasons not entirely clear, produces this highly efficient protein 
fluorophore. The emphasis here is on the fact that this is an ordinary protein in 
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terms of production and processing. Therefore, when cloned and fused to a pro- 
tein in an experimental organism, the thus labeled protein starts to glow, emitting 
measurable signals allowing direct quantification. A remarkable diversity of fluores- 
cent proteins have been developed recently, emitting signals at various wavelengths 
(Hawley et al., 2001). It should be noted that fusing the rather sizable fluorescent 
proteins to the target may have profound effects on the production rate or function 
of the labeled protein. The ease and efficiency with which one can label proteins in 
prokaryotes has led to several interesting applications, such as reverse engineering 
of bacterial regulatory networks (Ronen et al., 2002). 


10.6.3 High Throughput Detection of Regulatory Interactions 


By now the observant reader must have realized that novel methods in biological 
data acquisition more often involve the ingenious combination of existing technolo- 
gies than the introduction of truly novel measurement principles. The ChIP-chip 
technology is an elegant combination of chromatin immunoprecipitation and gene 
expression microarrays (hence the name) (Ren et al., 2000). In the first step, a 
tagged transcription factor is used to isolate the upstream DNA regulatory regions 
it is binding to. Then a DNA microarray containing probes for a large number of 
upstream gene regulatory regions, in the case of yeast for the entire genome (Ren 
et al., 2000), is used to determine which of the regions have been enriched during the 
first step. This will then constitute a microarray—based approach that will determine 
which regions a given transcription factor is binding to under a given experimental 
condition. This will effectively map out a network of putative regulatory interac- 
tions between gene expression regulators and regulated genes (Lee et al., 2002). 
Obviously, the experimental noise of the two technologies will be compounded, re- 
quiring various computational methods to produce reliable measurements by using 
independent supporting evidence from other data sources, such as the coregulatory 
patterns of genes (Bar-Joseph et al., 2003). 





10.7 Population Averaged versus Single Cell Measurements 


Although single molecule measurements may just be around the corner (Chan 
et al., 2004), the detection limits of currently applied measurement techniques in 
systems biology, especially the massively parallel technologies, require the presence 
of millions of a given molecular species for reliable quantification. Such a high 
number of molecules can be derived from at least tens of thousands or more cells. 
Consequently, most of the above-described methods will yield population averaged 
values. The potential risk of using population averaged data is well exemplified by 
an interesting study by the group of Ferrell (Bagowski and Ferrell Jr., 2001). In 
this study they showed that in individual Xenopus oocytes the activation of the 
kinase JNK shows a steep dose response curve with a Hill coefficient of around 
100. In population averaged measurements using hundreds of oocytes, the way 
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these experiments are usually performed, the same dose response curve shows 
no cooperativity with an apparent Hill coefficient of 1. This means that the 
ultrasensitive bistable switch of JNK activation can be detected only in single cell 
measurements, and a very different kinetic model would be built based on the 
population averaged data. In such cases, single cell measurements produce a more 
accurate description. 

The experiments described above involve the isolation of material for Western blot 
analysis from individual oocytes and thus require a significant amount of meticulous 
bench-work to obtain information about each individual cell. Fortunately, fluores- 
cent proteins (see section 10.6) provide a convenient experimental tool for relatively 
high throughput single cell measurements. Sometimes the genomic sequence of a 
given fluorescent protein is simply inserted behind a transcriptional promoter of 
interest. In these cases the expression level of the reporter protein serves as a surro- 
gate marker for promoter activity and can be quantified in hundreds or thousands 
of individual cells by direct measurement of fluorescent intensity (Elowitz et al., 
2002; Raser and O’Shea, 2004). This relatively simple experimental arrangement 
has already produced interesting results by directly demonstrating the stochasticity 
of gene expression in individual cells (Elowitz et al., 2002) and also by quantifying 
the noise of transcriptional and translational activity (Blake et al., 2003; Raser and 
O’Shea, 2004). A further level of experimental complexity can be achieved by fusing 
fluorescent proteins to other proteins of biological interest. For example, in a rather 
intriguing study the two members of the p53-Mdm2 feedback loop were labeled 
individually by fluorescent proteins of different colors (Lahav et al., 2004). Surpris- 
ingly, the study showed that in human cancer cells p53 was expressed in discrete 
pulses after DNA damage, with the number of pulses differing between individual 
cells. This is a potentially relevant and rather unexpected observation regarding 
the function of one of the most studied tumor suppressor genes, p53, which could 
not have been detected by population averaged measurements. 

The number of various fluorescent proteins and the procedure required to fuse 
them to other proteins of interest limits the number of proteins that can be stud- 
ied simultaneously in a single cell. A recent approach combining the application of 
fluorescently labeled antibodies with multiparameter flow cytometry (Irish et al., 
2004) may increase the number of quantified parameters of single cell measurements 
by one to two orders of magnitudes. A causal protein-signaling network was recon- 
structed from such measurements using Bayesian network inference (see chapter 11) 
on eleven key signaling proteins in T lymphocytes (Sachs et al., 2005). Again, the 
ability to measure changes in the signaling proteins in single cells as opposed to 
population averaged measurements was essential to the success of this approach. 





10.8 Conclusions: A Final Look at Experimental Design 


Researchers applying ordinary differential equations based models to intracellu- 
lar networks are well aware of the importance of accurate parameter estimations. 
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Therefore, these studies usually apply high accuracy, low throughput measurements 
for parameter optimization (Hoffmann et al., 2002; Schoeberl et al., 2002). However, 
not every chemical species can be measured accurately, therefore, these studies rely 
on measurements of only a subset of the variables represented in the model. Conse- 
quently, a set of interesting theoretical questions arises with practical implications. 
How many variables should be quantified for reliable parameter optimization of a 
network with a given size? By what strategy should this subset of parameters be 
selected? How does the inaccuracy of measurements propagate back to parame- 
ter estimation in a robust dynamic network (see chapter 11)? Problems like these 
demonstrate well the intricate relationship between theory, modeling and experi- 
mental biology. 
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Methods to Identify Cellular Architecture 
and Dynamics from Experimental Data 


Rudiyanto Gunawan, Kapil G. Gadkar, and Francis J. Doyle II 


A system-level understanding of the functioning behavior of a cell requires an accu- 
rate representation of the underlying complex networks of gene and protein inter- 
actions. Advances in molecular biology have provided a glimpse of such complexity 
through diverse measurements of cellular activities. In systems biology, the goal 
of network inference or reverse engineering problems is to reconstruct the complex 
network of regulatory interactions from available measurements using a mathemat- 
ical framework. Here, the reverse engineering effort faces two daunting problems: 
network size and complexity, and incomplete and inaccurate measurements. In ad- 
dition, complete knowledge of a cellular network entails the identification of not 
only the network architecture (topology) but also its dynamics. Indeed, implicit 
in the term regulation is the importance of dynamics of these interactions. Net- 
work inference from experiments has been extensively investigated in the field of 
engineering, which is known as system identification. In addition, many concepts 
in engineering, such as robustness (see chapter 2), modularity (see chapter 3), and 
optimality, have been observed in many biological systems. For these reasons, en- 
gineering approaches have been instrumental in the reverse engineering effort. This 
chapter highlights the methodologies and challenges in the reverse engineering of 
cellular networks, in particular the identification of network dynamics using engi- 
neering approaches. 





11.1 Introduction 


At the turn of the century, scientists successfully sequenced the human (Interna- 
tional Human Genome Sequencing Consortium, 2001) and other genomes, enabled 
by advances in high throughput measurements in molecular cell biology. The com- 
plete human genome provides the blueprint of human cells and creates opportunities 
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Figure 11.1 A hypothetical biological network topology. Each node in the graph can 
represent a biological entity such as genes, transcripts, or proteins. The edges show the 
interactions, such as activation/inhibition. In contrast, the dynamics describes the nature 
of each interconnection, for example, a Hill-type kinetics. The topology and dynamics 
completely characterize a network behavior. 


to advance our understanding of cellular functions. The success of genomics ush- 
ers in a new era that is characterized by a shift from a reductionist approach of 
molecular cell biology research in the past, to a systemic or integrated approach: 
systems biology. The emphasis in a systemic approach is to ascertain the complex 
interactions in the network of genes and proteins that produce the observed cellular 
phenotypes under different conditions and/or stimuli. Here, the function of a gene 
or protein is described in the context of its dynamical interactions with other ele- 
ments in the network. This chapter introduces the methods and challenges in one 
aspect of systems biology, namely the identification of the cellular networks from 
experiments. This area of research is also known as reverse engineering or network 
inference. 

Biological networks can be categorized according to the cellular functions that 
they describe, such as protein-protein network, transcriptional network, metabolic 
network, and signal transduction pathway. There exist two primary facets in 
a typical biological network, its topology and dynamics (kinetics). The former 
describes the interconnections among the parts of the network (genes, transcripts, 
proteins), while the latter gives the nature of these interactions. The dynamics can 
be as simple as a linear function, or a nonlinear function such as Michaelis-Menten 
kinetics. A schematic of a hypothetical biological network is shown in figure 11.1. 
The goals of reverse engineering such complex networks are also multi-faceted, 
including: (i) hypothesis generation, (ii) design of experiment, (iii) understanding 
of cellular function, and (iv) unraveling design principles. 

The sources of experimental data for the reverse engineering problems include 
large scale deletion projects, high throughput DNA microarray experiments, and 
chromatin immunoprecipitation assays (ChIP-on-chip) (see chapter 10 for more 
complete discussions on the data acquisition techniques for systems biology). The 
utility of these data, however, is limited by many factors, such as high level of 
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noise, low sampling frequency, type of experimental protocol, and other issues 
as highlighted in chapter 10. The noise inversely correlates with the amount of 
information in the data, while the low sampling frequency restricts the identification 
of dynamics in the network. Because of the wide variety of modeling objectives 
and the heterogeneous sources of data, there exist a wide spectrum of modeling 
approaches in the reverse engineering of cellular networks, as described in these 
reviews (D’haeseleer et al., 2000; Ideker and Lauffenburger, 2003; Stelling, 2004; 
Barabasi and Oltvai, 2004). The highest level of the spectrum, and hence the 
most abstract, involves models that mostly describe the network topology with 
little or no dynamics, such as signed directed graphs or Bayesian networks. The 
identification of network topology benefits greatly from these models as they can 
efficiently handle highly complex networks. The lower level models incorporate 
the physicochemical details into the network topology, which greatly increases the 
difficulty of reverse engineering problems (Ronen et al., 2002). Such models typically 
consist of differential equations such as those described in chapter 6, though they 
can be as simple as Boolean networks. 

One of the simplest representations of a cellular network is a directed graph 
(similar to the network shown in figure 11.1, but the edges have directions/arrows). 
The directed edges convey the flow of influence in the network. For example, a node 
A with a directed edge to a node B implies that A directly influences the activity 
of B. Such model structure mostly captures the network topology, which can be 
effectively reconstructed from gene perturbation data (Wagner, 2001). A step down 
in the modeling spectrum is a signed directed graph (SDG), which is also a graph 
node with directed edges. However, the edges here can assume positive or negative 
values based on the influences, that is, activation or inhibition, respectively. The 
network inference problem of this model structure uses comparative methods on 
gene expression of wild-type and mutants created from deletion experiments (Kyoda 
et al., 2004). 

Another model structure with a directed graph architecture is a Boolean network, 
which is also a graph node with directed edges. Here, the nodes assume binary 
numbers representing high or low levels (1 or 0, respectively). In a gene network, 
high level represents activated/expressed, while low indicates inactive/suppressed. 
Each directed edge corresponds to a Boolean logic function describing the influence 
of one gene to another. The inference problem of this model structure utilizes 
steady state gene expression levels from perturbation experiments of gene deletion 
or overexpression (Ideker et al., 2000). 

Bayesian networks use a probabilistic approach to modeling cellular networks 
with a directed acyclic graph, in the same spirit as chapter 4. Here, each node 
represents a random variable characterized by a conditional probability with respect 
to its immediate parent nodes (that is, the start nodes of incoming edges). Thus, 
the state of the network is described by the joint probability 


N 
P(X), ...,Xw) = [[ PX) (11.1) 


i=l 
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where N is the total number of nodes in the network, P(X;|Xy) is the conditional 
probability of the i-th node to assume the value X; given the values of its parents 
nodes Xj, j € J such that J is the set of indices of the parent nodes. This 
steady state model structure does not provide the dynamics of the network but can 
explicitly account for the noise from experimental measurements, protocols, and the 
inherent stochastic nature of gene expression. As in the previous model structures, 
the reverse engineering using Bayesian networks utilizes data from perturbed gene 
expression profiles (Pe’er et al., 2001). 

The more detailed model structures involve detailed dynamics of each interaction, 
such as the S-systems (Savageau, 1988). This model structure is based on mass 
action kinetics, in which the dynamics of interaction is described using nonlinear 
polynomial functions: 





N 

a =a; II are — ĝi JI a (11.2) 
j=l 

where x; is the state variable describing the concentration of cellular molecules 
(genes, proteins), a; and p; are the rate constants, and g;,; and h; j are the kinetic 
orders. This framework is flexible enough to capture common dynamics in cellular 
functions such as Michaelis-Menten kinetics. As expected, the inference problem of 
this model structure is computationally intensive because of the need to simulate 
highly nonlinear differential equations (Kimura et al., 2005). In addition, the reverse 
engineering of network dynamics (that is, estimating the model parameters aj, Bi, 
gi,j, and h; j) requires time-series data, whenever available. 

One model structure, based on Petri nets, attempts to combine the graph repre- 
sentation of the network and the detailed dynamics of differential equations. This 
hybrid functional Petri net (HFPN) architecture supports different cellular enti- 
ties using various primitive data types (Boolean, string, real), types of interactions 
(discrete/stochastic, continuous, generic), and prior knowledge of the system (Mat- 
suno et al., 2000; Nagasaki et al., 2004). Here, the nodes are connected to each 
other by connectors (arcs), and the dynamics are described by mappings associated 
with each connector. Because of its flexibility, the reverse engineering of this model 
structure can potentially accommodate any type of data, including gene expression 
and biological facts. 

The complete reverse engineering of a cellular network needs to identify both 
the topology and dynamics of interactions. The challenges in this problem are 
multiple, starting from the selection of model structures to the identification of 
model parameters from noisy measurements. In particular, the inference of network 
dynamics is difficult due to the data quantity and quality and the parameter 
identifiability issues, which will be discussed in greater detail in section 11.3. The 
underlying reason for the difficulty is the mismatch between the available and 
the required data to uniquely identify a model structure. Indeed, the selection of 
model structure determines the types and amount of data necessary for a complete 
reconstruction of the network (Selinger et al., 2003). For example, to identify p 


11.2 A Motivating Example 225 


number of parameters in a set of nonlinear differential equations, one theoretically 
needs 2p + 1 number of randomly chosen experiments (assuming zero measurement 
noise) (Sontag, 2002). When only the network topology is desired, the number 
needed reduces to r + 1 experiments, where r is the total number of possible 
connections. 

The purpose of this chapter is to provide a conceptual overview of the issues 
involved in the reverse engineering of cellular networks with emphasis on the 
dynamical characterization. This chapter complements and extends a previous 
review that emphasized the inference of network topology (D’haeseleer et al., 
2000). The next section gives a motivating example, which highlights some of 
the difficulties in a cellular network inference problem. Section 11.3 discusses the 
different issues and methodologies in reverse engineering with respect to both the 
topology and dynamics. Tutorials are presented in the form of case studies involving 
a metabolic network in FE. coli and a signal transduction pathway in a caspase- 
activated apoptosis. Finally, open research problems in this area are identified based 
on the analysis of the case studies. 





11.2 A Motivating Example 


High throughput gene expression profiles can provide system-wide level measure- 
ments for reverse engineering of genetic regulatory networks in the cells. The efficacy 
of these measurements for inferring the network information, such as the kinetic 
parameters, depends on the complexity of the underlying gene network as well as 
the quality and quantity of measurement data. These issues were addressed using 
a formal identifiability analysis by Zak et al. (2003), whose results will be summa- 
rized here. In particular, the analysis considers two types of identifiability: a priori 
identifiability and practical identifiability, as a function of the input perturbations 
and the fluctuations in the gene expression due to the inherent stochastic nature of 
the process. A priori identifiability is concerned with the ability to uniquely identify 
model parameters from noise-free experimental data, given a particular model and 
a particular input-output experiment. On the other hand, practical identifiability is 
concerned with the accuracy of parameter values that can be estimated from noisy 
measurements. 

In the aforementioned study, an in silico genetic regulatory network was con- 
structed from an arrangement of common regulatory motifs: cascade, mutual re- 
pression, auto-activation and sequestration, and agonist-induced receptor down- 
regulation (Zak et al., 2003). The model consists of 44 species with a total of 
97 parameters involved in 118 reactions, including promoter binding/unbinding, 
transcription, transcript degradation, translation, protein monomer degradation, 
protein dimerization/undimerization, and dimer degradation processes. The net- 
work exhibits multiple steady state behavior depending on the presence of a ligand 
input (see figure 11.2). The identifiability analyses were performed on the in sil- 
ico network as a function of the ligand perturbations: a step, a 1-hour pulse, and 
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Figure 11.2 Jn silico genetic regulatory network. Dashed arrows represent chemical 
reactions (not regulation). The dotted nodes (F, EQ, Q) exist only when the ligand Q is 
present in the system. In the presence of the ligand Q, the genes B, D, F, G, J, and K are 
fully expressed (HIGH state), while genes A, C, E, and H are suppressed (LOW state). 
On the other hand, the absence of the ligand drives the genes A, C, E, and H to HIGH 
and the genes B, D, G, J, and K to LOW states. 


two 1-hour pulses 1 day apart. In addition, as single-cell gene expression profiling 
is realizable (Hemby et al., 2002), the number of cells collected in each sampling 
time is also treated as an experimental variable. Two approaches, deterministic and 
stochastic, were used in the simulations of the in silico network with the parameter 
estimates from published values for genes and proteins with similar roles to those 
in the network. 


11.2.1 Methodologies 


A priori identifiability analysis utilizes the correlation matrix of the parameters, 
M. (Beck and Arnold, 1977) 


M,(i, j) = Vp(i, j) (Vp(i, DVI (11.3) 


where V,(i,7) is the (i, j)-th element of the parameter covariance matrix. The 
covariance matrix quantifies the degree of (co)variability in random variables (such 
as noise in the measurements, parameter inaccuracies), which is given by the 
expected value: 


Va = El(w — w) (w — w)7] (11.4) 


where w denotes the vector of random variables, w denotes the mean values 
of w, and Ef|-] represents the expected value operator. The (i, j)-th element of 


11.2 A Motivating Example 227 


the symmetric M. conveys the degree of correlation between the i-th and j-th 
parameters where a value of 1 (—1) implies a perfect (opposite) correlation (the 
diagonal elements of M, are exactly 1 since a parameter is perfectly correlated 
with itself). For example, consider the following simple system: 


y = (pi + pe) u (11.5) 


where y is the output and u is the input. From the measurements of y (given 
u), one can only identify the sum of parameters (pı + p2), but not pı and po 
independently. Here, the two parameters are said to have a perfect correlation (in 
this case, a correlation coefficient of 1). Thus the parameters that have correlations 
between -1 and 1 (that is, —1 < Me(i,j) < 1) can be independently identified 
from experimental data (assuming perfect measurements). Thus a parameter that 
has a perfect correlation only to itself is said to be a priori identifiable. Further, 
the parameters that do not satisfy this condition can be reduced to a smaller set 
of identifiable parameters through an iterative parameter reduction process (Zak 
et al., 2003). 

Practical identifiability analysis uses the Fisher information matrix (FIM) as a 
measure of the informativeness of noisy measurement data for estimating the model 
parameters. The inverse of FIM provides the lower bound for the variances of the 
parameter estimates (or the upper bound for accuracy) based on the Cramer-Rao 
inequality (Ljung, 1999a). If the noise in the data follows the Gaussian distribution, 
the FIM reduces to 


N 
FIM = X SẸ (ti)V,,'Sz(ts) (11.6) 
i=1 
where S; is the sensitivity matrix with respect to the a priori identifiable parameters 
(Varma and Palsson, 1994) and V,, is the covariance of the measurements (a 
measure of noise in the data). The lower bounds of the parameter variances are 
given by 


o2, > [FIM Ji, (11.7) 
from which the 95% confidence interval for each parameter p; can be defined as 
[pi — 1.960y,, pi + 1.9605, ] (11.8) 


In this example, a parameter is called practical identifiable when its estimated value 
is non-zero within a 95% confidence. The level of confidence interval for practical 
identifiability can be varied to include or exclude more parameters. 

The Gaussian assumption may not apply for gene expression as this process 
involves very low copy numbers of chemical species, which makes it behave as a 
discrete stochastic system (see chapter 8 for the mathematical description of such 
system). With a lower bound of zero for the number of copies, for instance, the 
distribution will not be symmetric and the noise in the system can become non- 
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Gaussian (such as log-normal or bimodal distributions). Nevertheless, the FIM can 
still be evaluated using a direct analysis of the chemical master equation (Gunawan 
et al., 2005). Here, the Fisher information matrix is expressed as the variance of the 
score function (Cover and Thomas, 1991). For discrete stochastic systems, the FIM 
can be evaluated by simulating the master equation for the joint probability density 
function of the states. This simulation uses a Monte Carlo approach such as the 
stochastic simulation algorithm (Gillespie, 1976) or its approximate accelerated 
algorithm as discussed in chapter 16. The score function is equivalent to the 
normalized sensitivity of the joint distribution function with respect to the model 
parameters. 


11.2.2 Insights from Identifiability 


A priori identifiability analysis applied to the in silico network revealed that one- 
third to over half of the parameters in the network are not a priori identifiable, 
where the step ligand input performed the worst among the three perturbations. 
A major fraction of these parameters belonged to promoter binding/unbinding and 
transcription factor dimerization/undimerization, of which many exhibited perfect 
correlations. This result suggested that some of these parameters can be combined 
by equilibrium assumption for these processes by setting the forward and reverse 
rates equal. After removal of the parameters that can be combined or measured 
directly from experiments such as mRNA degradation rates (Wang et al., 2002), 
the step ligand input only allowed 3/4 of the parameters to be a priori identifiable, 
while the pulse experiments allowed 8/9 of all parameters. 

Further analysis of the model using the FIM showed that only about half of 
the parameters were practically identifiable, of which the double pulse experiment 
performed the best among the three. In addition, the number of cells sampled 
in each experiment, which was captured using discrete stochastic simulations of 
the network, affected the fraction of identifiable parameters in a nonlinear man- 
ner. Increasing cell count in each sample reduced the noise in the measurements 
and improved the practical identifiability in a diminishing return trend. The term 
noise here relates to the inherent (discrete) stochastic nature of gene expression, 
which differs from the more common data noise arising from the measurement de- 
vices. The impact of cell sampling was especially pronounced in the identifiability 
of transcriptional interactions (differences between bound and unbound transcrip- 
tional parameters). Here, the step perturbation gave a higher fraction of identifiable 
parameters at a lower number of cell sampling while the double pulse input became 
more efficient at higher cell sampling. 

This example highlights a number of difficulties in reverse engineering a cellular 
network. The first analysis showed that even with prior structural knowledge of 
the network and noise-free experimental data, a priori identifiability of the full set 
of parameters remained elusive. This finding signified the importance of obtaining 
good prior estimates of the model parameters and avoiding over-parametrization 
of the network through model reduction such as the equilibrium assumption. 
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In addition, the practical identifiability analysis underscored the importance of 
designing the experiment protocol, such as the type of ligand perturbation used in 
this study, to produce the most informative experimental data (see section 11.4.2). 
In general, a perturbation that is rich in dynamics can more effectively excite 
the system for accurate estimation of the kinetic parameters. Also, the intrinsic 
stochastic nature of gene expression can play an important role in the practical 
identifiability of the parameters when only few cells are collected at each time 
point. This effect, however, was diminished with increasing number of collected 
cells. 





11.3 Issues for 


Network Inference 


The problem of network inference or reverse engineering has long been an active 
research area in control theory, known as system identification (Ljung, 1999a). In 
addition to control theoretic approaches, research in other fields such as computer 
science and statistics (known as machine learning and statistical learning, respec- 
tively) have also made significant contributions to this problem (Bock and Gough, 
2003; Perrin et al., 2003). However, the reverse engineering of cellular networks 
pushes the envelope on many approaches in these fields because of the characteris- 
tics of these networks: large size and high nonlinearity. As such, the modeling efforts 
have focused on capturing both aspects: (i) network complexity, and (ii) level of de- 
tail (Stelling, 2004). Unfortunately, identification of models that embody both high 
complexity and details of a cellular network is an untenable problem, and thus, 
one major issue in reverse engineering as well as in data acquisition is to select the 
appropriate model structure that balances the network complexity and the detail 
of interactions. This selection depends on the type of network and organism, the 
available experimental data, and the intended use of the resulting model. 

Many conceptual approaches from system identification have found appropriate 
uses in the identification of cellular networks. For example, a singular value decom- 
position was used to identify all possible networks that are consistent with given 
gene expression profiles (Yeung et al., 2002). When choosing the solution among 
the candidate networks, this approach also assumed that the biological networks 
are sparse. Using a similar assumption, Gardner and colleagues (Gardner et al., 
2003) proposed a network inference algorithm based on linear regression of gene 
expression profiles. Here, each gene is assumed to have only k connectivities, where 
k is (much) fewer than the total number of genes in the network. The solution is 
then chosen to minimize the mismatch between the model prediction and experi- 
mental data. Further, an iterative approach was proposed by Tegner et al. (2003) 
to identify the network connections from gene perturbation data. At each iteration, 
the algorithm ranks the genes based on the variance of predicted connectivities 
from all consistent solutions. The gene that has the highest variance, that is, the 
most uncertainty, will be selected for the next perturbation experiment. Again, the 
network was assumed sparsely connected as in the other approaches. 
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A significant challenge of constructing the cellular network from experiments, 
especially for a gene network model, is the large number of nodes, on the order 
of 10,000, that renders the inference problem practically intractable (for example, 
determination of 108 parameters of interactions). Fortunately, cellular networks are 
tremendously sparse and highly structured (Wagner, 2002), such that the actual 
interactions to be identified are orders of magnitude fewer. In addition, these 
networks are not randomly connected, but highly modular and structured with 
regular hierarchies, motivating the use of a structured approach to the identification 
of such networks (Zak et al., 2005). One hierarchical decomposition of the network is 
to call the top level a network which is comprised of regulatory motifs of 2-4 genes 
(Lee et al., 2002; Shen-Orr et al., 2002; Zak et al., 2003). By searching through 
biological networks for common motifs, one can find the frequencies with which 
each simple motif occurs in the network. The much higher occurrences of these 
motifs in cells than in randomized networks (Shen-Orr et al., 2002) give support to 
a postulation that these are the basic building blocks of cellular networks. Many 
of these motifs have direct analogs in system engineering architectures, such as the 
three dominant motifs in E. coli: (i) feedforward loop, (ii) single input module, and 
(iii) densely overlapping regulon (Shen-Orr et al., 2002). At the lowest level of the 
hierarchy is the module that represents transcriptional regulation, of which a nice 
example is given by Barkai and Leibler (2000). The existence of structures in the 
complex cellular network creates an opportunity for reverse engineering methods to 
incorporate this knowledge by constraining the search methods or exploiting prior 
knowledge in Bayesian frameworks. 

The interconnections between the nodes in a cellular network are not static. In 
fact, dynamic behavior is an essential property of complex biophysical networks 
(Zaslaver et al., 2004) that must be captured in the modeling efforts. There exist 
preliminary ideas in capturing network behavior using dynamic models in both 
discrete time (Hartemink et al., 2002) and continuous (Zak et al., 2004). The 
problems associated with the curse of dimensionality as noted above are more 
pronounced when one augments the dynamics with the network interconnections, 
especially for a full continuum model. Here, one major issue is the challenge of 
uniquely identifying the kinetic parameters from experimental data, typically gene 
expression profiling. This issue, known as parameter identifiability in control theory 
(Ljung, 1999a), deals with the informativeness of the data: the quantity and quality 
of the measurements with respect to the model parameters. The example in the 
previous section revealed that full knowledge of gene interconnections and perfect 
measurements still could not guarantee full identifiability of gene interactions. 

Coupled to this, the noise in measurements and the inherent stochastic nature of 
gene expression make practical identification of genetic regulatory networks difficult. 
In practice, the reverse-engineering of a gene network should involve a careful design 
of experiment using prior knowledge of the system to obtain the most informative 
measurements. As described by the cycle of knowledge in chapter 1, this process 
should be iterative, in which the result from each trial is used to better design the 
next experiment. Here, a measure of informativeness of data, such as the Fisher 
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Figure 11.3 Fisher information matrix-based optimality criteria. The axes represent the 
model parameters where the origin describes the best parameter estimates. For simplicity, 
only two parameters are shown. In a system with three or more parameters, these ellipses 
are projections of the higher dimensional ellipses (hyperellipsoids) onto two-parameter 
axes. (a) The ellipse of information. The ellipsoidal axes are defined by the FIM eigenvalues 
and eigenvectors. The area quantifies the amount of information, while the shape indicates 
the distribution of information for each parameter. D-optimality design aims to maximize 
the area/volume of information (as indicated by the arrows), which is proportional to 
the determinant of FIM. (b) The ellipse of parameter uncertainty. The lengths of the 
ellipsoidal axes are equal to the inverse of the eigenvalues of FIM. A-optimal design aims 
to reduce the region of parameter uncertainty (shown by the arrows), which is measured 
by the sum of the parameter variances. 


information matrix, can help in formulating the optimal experiment design into a 
(nonlinear) optimization problem. The Fisher information matrix (FIM) takes into 
account the noise in the measurements and also gracefully handles the stochastic 
effect of gene expression. In addition, the FIM allows flexibility in choosing the 
appropriate criterion for optimality depending on the goal of model identification. 
Figure 11.3 illustrates the two most effective FIM-based optimality criteria, D- 
optimal and A-optimal, in designing experiments (Emery and Nenarokomov, 1998). 
D-optimal design aims to maximize the degree of informativeness in data by 
maximizing the determinant of FIM, which corresponds to the area/volume of the 
information hyperellipsoid (figure 11.3a). On the other hand, A-optimal design is 
equivalent to reducing the hyperellipsoid of uncertainty in parameter estimates 
(figure 11.3b). 

The iterative nature of this framework for model development and refinement of 
experimental protocol necessitates a termination criterion, which typically consists 
of a model validation test. The selection of tests to use still remains an open 
research problem because of the difficulty in comparing the performance of different 
algorithms. In the application domain of systems engineering, it is understood that 
for certain experimental data, it is not possible to absolutely confirm whether 
a model is valid. Typically, the converse test is implemented, that is, whether 
the given data contradict the model prediction (Poola et al., 1994). Such model 
(in)validation tests for reverse engineering problems can be formulated based on 
the difference between predicted and observed output with some statistics about 
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Figure 11.4 A schematic diagram of an observer. In a physical system, the complete 
state information is usually not available. Further, the measurements typically represent 
functions of only part of the states. An observer uses (inexact) knowledge of the system 
dynamics f (x, u, t) and the input u to “guess” the states from the available measurements 


y. 


these differences. These statistics limit the degree of model errors using, for example, 
maximum absolute value, mean value, and variance. 

Aside from the aspect of the quality of data, another practical limitation in most 
(if not all) attempts to reverse engineer cellular networks is the limited quantity 
of data, both in terms of sampling frequency and number of independent measure- 
ments. For example, although gene expression profiling can provide high throughput 
data to estimate interactions among thousand of genes, this method still does not 
depict the protein-mediated regulatory effects. As noted in chapter 10, current sys- 
tem level modeling efforts face the challenge of compromising data quantity and 
quality (low throughput, accurate measurements versus high throughput, relatively 
inaccurate measurements). In many cases, parameter estimation from limited mea- 
surements suffers from stringent computational requirement and degeneracy, where 
many parameter combinations give similar agreement to the observed behavior. 
Here, measurement selection procedures can help identify the combination of mea- 
surements that give the best identifiability. Also, an observer can provide estimates 
of all system states (gene, transcript, protein levels) from limited measurements. 

The concept of an observer is described in figure 11.4. The purpose of an ob- 
server is to infer the states of a system (for example, internal energy, entropy) from 
the measurements (such as temperature, pressure). For this reason, an observer 
is also known as a state estimator in control systems theory. There exist multi- 
ple approaches for designing an observer for biological systems, including extended 
Kalman filters (Stephanopoulos and San, 1984; Gee and Ramirez, 1996), artificial 
neural network (Glassey et al., 1997; Simutis and Liibbert, 1997), and state regula- 
tor problem (SRP) (see Section 11.4.2) (Gadkar et al., 2005b). The state regulator 
problem approach builds on dynamic flux balance analysis (AFBA) described in this 
chapter to estimate the unknown variables in a biological system from dynamical 
measurements. The dFBA extends traditional flux balance analysis (Varma and 
Palsson, 1994) to allow the estimation of dynamic fluxes in a given metabolic net- 
work. The SRP observer is formulated as a constrained optimization problem where 
the gene network is assumed to operate optimally by minimizing unnecessary accu- 
mulation of intermediates (states) and fluxes (reactions) in the framework of dFBA. 
Given the full estimates of the network states and fluxes, the parameter estimation 
becomes decoupled and thus computationally efficient with lower probability for 
degeneracy. 
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11.4 Case Studies 


The case studies derive from the applications of engineering approaches to the 
reverse engineering of cellular networks, in particular to identify the dynamics 
of biological networks. In the first example, an optimization-based approach is 
demonstrated for estimating the dynamic behavior of a metabolic network in E. 
coli from experiments (Mahadevan et al., 2002). The second example introduces a 
framework for iterative network inference to identify a signal transduction pathway 
in a caspase-activated apoptosis (Fussenegger et al., 2000). 


11.4.1 Dynamic Flux Balance Analysis 


As recent developments in genomics provide information of the cellular architecture, 
the logical next step is to study the dynamic behavior of the cellular network. A 
primary bottleneck for this is the lack of kinetic information of the intracellular 
reactions. The flux balance analysis approach in chapter 5 utilizes the known 
stoichiometry to predict the flux distributions in the network without requiring 
the kinetic information (Varma and Palsson, 1994). However, the approach can 
be used to study only the steady state operations of the network, preventing its 
applicability in situations where dynamic reprogramming of the metabolic network 
is important. In this case study, a dynamic Flux Balance Analysis (AFBA) approach 
is discussed that is capable of predicting the dynamics of the metabolic network 
with modest requirements of experimental data. 

To motivate the concept, we consider the diauxic growth of Escherichia colt. 
Using the metabolic network of E. coli, the extreme pathways are identified with 
glucose, acetate, and oxygen as input and acetate and biomass as output. From the 
extreme pathways, four primary pathways are determined, based on the biomass 
yield, to represent both aerobic and anaerobic growth on glucose and aerobic 
utilization of acetate. These pathways are expressed as a simplified network shown 
in figure 11.5. A dynamic model for the prediction of the time profiles for the batch 
bioreactor based on the simplified network is represented in the equations, 





dGlext = St elctty X 
dt 
= = St^vX 
dt 
d 
ae = St”?vX + kya(O3 a O2) 
dX 
ae = (vı + v2 + U3 +u4)X (11.9) 


where X represents the biomass concentration, St@’“"’, St4°, St©? are the rows of 
the stoichiometric matrix associated with glucose, acetate, and oxygen, respectively, 
v is the vector of reaction fluxes, and kja is the mass transfer coefficient for oxygen 
(7.5 hr~') and Ož is the oxygen concentration in the gas phase (0.21 mM). 
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À 7 n Vy 9.46 Glext + 12.92 O, >X 
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Figure 11.5 The simplified metabolic network of the diauxic growth in E. coli with 
glucose, acetate, and oxygen as inputs and biomass and acetate as outputs. 





To determine the dynamic profiles of the metabolite levels, the dynamic fluxes 
need to be determined. With the absence of kinetic information that relates the 
fluxes to concentrations, dynamic optimization is proposed for determining the 
fluxes and metabolite concentrations. It is based on an assumption that the cellular 
processes are performed optimally in order to achieve a cellular objective. Similar 
assumptions are made in the FBA approach (Edwards et al., 200la) and in the 
cybernetic modeling approach (Varner and Ramkrishna, 1999). For the dFBA 
approach considered here, maximizing the instantaneous growth rate is proposed 
as the built-in cellular objective. Other candidate objective functions which have 
shown good fit of experimental data include maximization of biomass (Burgard 
and Maranas, 2003) or minimization of total fluxes (known as the principle of flux 
minimization) (Holzhiitter, 2004). 

The dFBA approach involves an optimization over the entire time period of 
interest to obtain time profiles. The optimization problem is shown below: 


M “f 
X(t) 

ae a Kera Ot — t;)dt (11.10) 

J=0 io 

such that 
= = F(v,z) (11.11) 
lV] < vmar; Z< Zo; c(v,z)<0 V tE [to, ty] (11.12) 
z(to) = Zo (11.13) 
f— to 

ti =to+j 0, M (11.14) 


The time period of interest is divided into finite number of intervals (equation 
11.14). The optimization maximizes the growth rate at each of these intervals. 
The objective function is scaled such that all points are equally weighed. Equation 
11.11 is the matrix representation of equation 11.9 and represents the mass balance 
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V; 


Figure 11.6 An example of a system with two fluxes and two constraints A and B 
(aside from the positivity of fluxes). Each inequality constraint is shown in each shaded 
region. The combination of the two constraints A+ B limits the feasible space of flux pairs 
for the dFBA optimization. 


and continuity constraints. The rate of change of flux constraint, the non-negative 
metabolite level constraint, and the additional nonlinear constraints are imposed 
by equation 11.12. As discussed in chapter 5, these constraints reduce the feasible 
search space for the fluxes that maximize the objective function, as illustrated in 
figure 11.6. Equation 11.13 represents the initial conditions of all species. 

In most cases, limited fermentation data are available. Substrate and oxygen 
uptake rates and product formation rates are usually calculated. These limited 
experimental data are used as additional constraints to the dynamic optimization 
problem. For this case study, the glucose uptake rate, and oxygen uptake rates are 
bound by the additional constraints shown below: 

Gleat One Glext 
SS aia (11.15) 
StO2v < yO (11.16) 


Max 


The glucose uptake is bounded by the Michaelis-Menten kinetic involving the 
glucose concentration, and the oxygen uptake is bounded by a maximum possible 
flux. The unknown constants in the above equations are determined from the 
available experimental data (vG!¢?*=10 mmol/gdw-h (Varma and Palsson, 1994); 
Km=0.015 mM (Wong et al., 1997); v92,,=15 mmol/gdw-h (Varma and Palsson, 
1994)). 

The dynamic optimization is solved by parameterizing the dynamic equations 
through the use of orthogonal collocation on finite elements (Cuthrell and Biegler, 
1987). Details of solving the dynamic optimization problem are discussed by Ma- 
hadevan et al. (2002). Figure 11.7 shows the profiles of the the metabolite levels 
suggesting that the dFBA approach accurately predicted the dynamics of the di- 
auxic growth on glucose and acetate. The dFBA also correctly predicted the re- 
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Figure 11.7 Model predictions using the dFBA. Glucose, acetate, and biomass con- 
centrations from the model predictions (solid lines) are compared to experimental data 
(squares) (Varma and Palsson, 1994). 


utilization of the acetate, which was not possible with the classical FBA approach 
(Varma and Palsson, 1994). 

The dFBA approach does not require kinetic information of the intracellular 
reactions; it could, however, incorporate available kinetic information into the con- 
straints of the dynamic optimization. Further, it allows the dynamic formulation 
of the objective function describing characteristics, such as, reduction of transi- 
tion time between two steady states or end-point optimization into the rigorous 
mathematical framework. The primary drawback of the approach is that it typi- 
cally requires solving a nonlinear optimization problem. As the size of the network 
increases the computation burden could become infeasible. However, the use of a 
simplified form of the important pathways as done here assists in capturing the dy- 
namics of the crucial components of the network. In summary, the dFBA approach 
provides a useful tool for the quantitative study of the dynamic reprogramming of 
metabolic networks to obtain a better understanding of the behavior of the network. 


11.4.2 Iterative Model Identification 


As mentioned in the previous section, the reverse engineering of a cellular network 
should involve an iterative process. One possible framework for this process is 
depicted in figure 11.8. The model identification step is decoupled into two parts. 
The first part uses the limited measurements to give estimates of time profiles for all 
concentrations and reaction rates. These full estimates of system variables allow for 
an efficient parameter estimation in the second part. When a model (in)validation 
step necessitates further model refinement, an optimal experiment design and/or 
an optimal measurement set is determined to guide the next experiment. 

The application of the framework is demonstrated for the model identification 
of caspase function in cell apoptosis. The schematic of this system is shown in 
figure 11.9, which was developed by Varner and co-workers (Fussenegger et al., 
2000). This model with the published parameters is assumed as the “real” system. 
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Figure 11.8 Iterative scheme for model identification. 


The network topology and the mechanism of the interactions are assumed to be 
known. The model can be represented in a very general form: 


x = Ax+Br+C (11.17) 
r = fp) (11.18) 


where the vectors x and r represent the states and reaction rates, respectively. 
The matrices A and C describe the degradation and auto-generation respectively, 
whereas the matrix B represents the stoichiometry of the network. The nonlinear 
function f(x, p) represents the reaction rate equations. Further details of this model 
representation are included in Gadkar et al. (2005b). A discrete version for the 
continuous time invariant affine system is derived using a standard technique known 
as the zero-order hold (Brogan, 1991). The discrete model equation is represented 
as: 


x(k +1) = Ax(k) + Br(k) + C (11.19) 


where 


A 
B = (e“47 — 1)A~'B, 
Č = (e447 _1)AC. 


The goal of this case study is to identify the kinetic parameters p in the nonlinear 
reaction rates. 
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Figure 11.9 Caspase-activated apoptosis mechanism. The model includes two triggers 
for the activation of cell suicide mechanism, extracellular death ligand and stress-related 
factor (Fussenegger et al., 2000). The cell death occurs when executioner caspase is 
activated by caspase-8 (ligand effector) or caspase-9 (stress-related effector). 





A possible first step in the model identification framework is the measurement 
selection. Parameter identifiability is crucial prior to the parameter estimation from 
experimental data. Practical identifiability of parameters discussed in section 11.2 
is used for the selection of measurements that minimize the confidence interval 
(equation 11.8) for the model parameters. In this case study, the efficacy of model 
refinement by changing the experiment design or by improving the measurement 
set is compared. Thus, the first iteration in this case study is carried out with a 
suboptimal measurement set. The details of parameter confidence intervals for both 
optimal and suboptimal sets are included in (Gadkar et al., 2005a). 

The model identification is decoupled into two parts: a state regulator problem 
(SRP) based estimator and a parameter estimation step. The SRP estimator 
uses the limited measurements to determine the time profiles of all unknown 
concentrations and reaction rates. It is based on a premise, similar to the dFBA 
approach, that cellular processes have evolved regulatory structures to optimally 
use the cellular resources. This translates into two postulates: (1) network flows 
are managed to minimize intracellular accumulation and (2) networks are managed 
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to minimize the number of edges carrying flux. The estimator is formulated as a 
quadratic optimization problem as shown below: 


h-1 


gs min. p 2a + I+ DE Wark +j + 1) +r(k +5) Werk + J) (11.20) 
r) (kth) E 


subject to: 


x(k+j+1)=Ax(k+j)+Br(k+j)+C Y j=0,---,h-1 (11.21) 
x(k+j+1)>0 Y j=0,--,h—1 (11.22) 
Ix(k+7+1)—x*(kK+7+)| <A Y j=0,--,h—1 (11.23) 








The objective function consists of two terms: the first represents the accumula- 
tion of the intracellular species, and the second describes the flux utilization. The 
terms W, and W, are the matrices of weights associated with these two terms. 
The SRP optimization is subject to constraints of mass balance (equation 11.21), 
non-negativity of concentrations (equation 11.22), and constraints imposed by the 
available measurements (equation 11.23). The term x* represents the measurements 
and A;,; denotes the tolerance around the measurement describing the measure- 
ment error. Finally, the variable h denotes the prediction horizon of the SRP es- 
timator. The optimization problem is solved for each sampling time to determine 
the profiles of all fluxes and species concentrations. 

The SRP estimates of all system variables allow for efficient determination of the 
parameter values by decoupling the full parameter estimation into multiple sets, 
each with fewer parameters. The kinetic parameters associated with a reaction rate 
are determined independently from the others using a Bayesian approach, known 
as maximum a posteriori estimation (Gunawan et al., 2003). In this formulation, 
the difference between the SRP rate estimate and that predicted by the rate 
equation (equation 11.18) is minimized. Further, the deviations of parameter values 
from those obtained in the previous iteration are penalized. The formulation is 
represented as: 


min G -r@p)) Vit (#-2'@p)) + (p — p°)T V7 (p - p°) (11.24) 


where f? and X are the SRP estimates of the i-th reaction rate and the concen- 
trations, respectively, Npr represents the total number of reactions in the network, 
p° is the vector of parameter values obtained in the previous iteration, and V, 
and V, are the variances of the reaction rates and the parameters, respectively. 
The parameter variances are determined using equation 11.7, and the reaction rate 
variances are determined from the noise in the measurements from which the rates 
were estimated. The second term in the objective function of equation 11.24 is zero 
in the first iteration. 

An important step in the iterative approach is the model refinement method. In 
this work, the model refinement is achieved by an optimization of the experiment 
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protocol or an optimal selection of the measurement set. The optimal experiment 
design maximizes the number of identifiable parameters, which is determined using 
the orthogonal procedure of McAuley and colleagues (Yao et al., 2003). When there 
exist multiple experiment designs with the same number of identifiable parameters, 
the selection is done to maximize the data informativeness by maximizing the D- 
optimality criterion. Mathematically, the optimal experiment design determination 
is given by: 


max det (FIM) (11.25) 
s.t. [pag Na! 
ECE 


where E denotes the parameterized space of experiment protocol and Nj denotes 
the number of identifiable parameters. 
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Figure 11.10 Model predictions of a few of the concentrations and reaction rates of 
the apoptosis model. Reaction 1 involves the binding of the FADD to the FAS-FASL 
complex; reaction 2 involves the activation of executioner procaspase by caspase-8. Solid 
line: real system; dashed line: prediction with estimated parameters after first iteration 
(suboptimal experiment with suboptimal measurements); dash-dotted line: prediction 
with estimated parameters after second iteration (suboptimal experiment with optimal 
measurements); dotted line: prediction with estimated parameters after second iteration 
(optimal experiment with suboptimal measurements). 


Figure 11.10 presents the time profiles of a few species concentrations and reaction 
rates predicted by the models identified by the iterative framework, which show the 
improvements in model predictions with each iteration. As model identification is 
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closely related to the parameter identifiability, the improvements are better when 
using the optimal experiment design than the optimal measurement set (Gadkar 
et al., 2005a). 

This case study demonstrates an iterative framework for model identification 
to study quantitatively the dynamics of cellular pathways. It shows tremendous 
potential in improving the predictive capabilities for biological systems, especially 
in cases where experimental data is available but the kinetic parameters involved 
in the pathway reactions are unknown. 





11.5 Summary 


and Future Directions 


The reverse engineering of cellular networks represents a crucial aspect of a systemic 
approach for biological discovery in the post-genomic era. The major hurdle in 
this task is the high complexity of cellular networks, implying models with large 
numbers of nodes and interactions. Fortunately, the cellular networks appear to 
have structures (that is, they are not random) that are shared with engineered 
systems (see chapter 3). This may be one reason for why the application of 
engineering methods, in particular systems identification, has shown to be fruitful 
in approaching these problems. 

In brief, the challenges in the cellular network inference condense to the inte- 
gration between experimental and modeling efforts. Advances in molecular biology 
have allowed high throughput measurements of the interactions, but the data may 
not carry sufficient information to (uniquely) identify the network interactions, as 
shown in one of the examples above. In addition, the large size of network mod- 
els renders the inference problem practically intractable. The case studies in the 
chapter demonstrate attempts to solve these problems by building an estimator for 
the network and formulating an iterative model identification framework. Here, the 
experiments and models are coupled together through a model-based experiment 
design and a Bayesian approach to incorporate prior knowledge of the network. 
Such engineering method and other approaches from engineering, computer sci- 
ence, and statistics have found great successes in their domains and will likely find 
greater application in systems biology as experimental methods are refined and 
closer collaborations are developed between modelers and experimentalists. 

There still exist many open research problems in the reverse engineering of cellular 
networks. On the experimental front, the challenges are: (i) to improve the signal- 
to-noise ratio in the measurements, (ii) to develop new tools for measuring the 
cellular concentrations, fluxes, and interactions in both space and time, and (iii) 
to incorporate model-based design of experiment protocol. All of these will allow 
efficient and accurate dynamical modeling of the networks. The efforts here can 
benefit from existing models to identify the most useful type of measurements, 
for example, using information from sensitivity analysis. As experimental data will 
come from different measurements, data preprocessing may become necessary to 
extract relevant information before the inference step. 
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On the modeling front, the main challenge still remains in the formulation of 
model structures that can exploit the characteristics of cellular networks: sparsity, 
hierarchy, robustness, and optimality. Decomposition methodologies can exploit 
these characteristics and reduce the network inference problem to a reasonable 
scope. Categorically, these methods fall into either horizontal or vertical decompo- 
sition. The horizontal approaches focus on the topology of the network by dividing 
the network into building blocks (such as the aforementioned motifs and modules). 
The vertical approaches decompose the network based on the time scale of intercon- 
nections (that is, dynamics). There exists a need to integrate the two approaches 
in systems biology to obtain integrated system models. As noted above, the goal 
is to strike a balance between the size and level of detail, that is, a model struc- 
ture that can sufficiently capture the dynamical behavior of a cellular network and 
is also amendable for numerical simulation and analysis in model identification. 
There may be no universal model for all cellular systems and purposes, but rather 
a tailored model structure for each system and use. Again, the modeling research 
should be integrated with the experimental efforts such that advances in each area 
will improve the other. 
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Much work has been devoted to determining the responses of biochemical networks 
to changes in their environment or their internal components. These studies have 
been motivated both by direct application to metabolic engineering and pharma- 
ceutics as well as by the desire to improve our understanding of the behavior of 
these systems. This sensitivity analysis has focused primarily on the steady state 
(asymptotic) response of a system to constant (step) changes in parameters; see 
also chapter 1. However, there are cases in which a dynamic analysis of system 
response is crucial. This is clearly the case for mechanisms whose nominal behavior 
is time-varying, for example, the cell cycle. Moreover, investigations of the transient 
behavior invoked in signal transduction networks or the role of Ca?* oscillations as 
a second messenger demand a dynamic analysis. This chapter presents a framework 
which is ideally suited to analysis of dynamic systems. Tools from control theory 
can be applied to elucidate the functioning of self-regulating (homeostatic) systems 
and to predict the effect of perturbations. 





12.1 Linear Systems and the Frequency Response 


We begin with an introduction to the framework of linear systems and one of the 
primary tools for describing their behavior: the frequency response. These ideas can 
be seen as a natural extension of a standard approach to analysis of biochemical 
systems: parametric sensitivity analysis. 

Analytic tools for the study of the sensitivity of biochemical systems have 
been developed within the fields of Metabolic Control Analysis (MCA) (Kacser 
and Burns, 1973; Fell, 1992; Hofmeyr, 2000) and Biochemical Systems Theory 
(BST) (Savageau, 1976; Voit, 2000). This analysis is carried out in a linear (or 
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log-linear) regime in which only small perturbations are addressed. This restriction 
is necessary since it is only after linearization that the analysis becomes tractable. 

The same approach is taken here — the linearized response of a biochemical system 
is considered. The sensitivity analysis is extended by considering the response not 
just to constant parameter changes, but also to time-varying perturbations. This 
is achieved through a frequency domain analysis that describes the response of 
the system to a canonical set of inputs (sinusoids). The response to arbitrary 
perturbations can be reconstructed by use of the Fourier transform. This analysis 
can be interpreted as an extension of MCA as presented by Ingalls (2004). 


12.1.1 Linear Input/Output Models of Biochemical Networks 


A network consisting of n chemical species involved in m reactions is modeled. 
The n-vector s is composed of the concentrations of each species. The r-vector p 
is composed of the (external) parameters of interest in the model. The m-vector 
valued function v = v(s,p) describes the rate of each reaction as a function of 
species concentrations and parameter values. Finally the n by m stoichiometry 
matrix N describes the network: component N; j is equal to the net number of 
individuals of species i produced or consumed in reaction j; see chapter 5. The 
network can then be modeled by the ordinary differential equation 


a 
dt 


The vector p contains any external parameters which have a direct effect on the 


(t) = Nv (s(t), p(t)) for all t > 0 (12.1) 


rates of the reactions (including, for example, kinetic constants of enzymes and 
external effectors). 

For the purposes of this presentation, we will assume that the species concen- 
trations are not constrained by any structural conservations (as when there are 
conserved moieties), and so the matrix N has full row rank. For a treatment of the 
general case, see (Ingalls, 2004). 

Local analysis of system 12.1 will be carried out in the neighborhood of a steady 
state (s°, p?) of interest. This point is brought to the origin by a change of variables 
in the states: x(t) = s(t) —s®, and in the parameters: u(t) = p(t)— p°. The n-vector 
x and the m-vector u indicate the deviation from the nominal state and parameter 
values of system 12.1, respectively. The linearized system then takes the form 


Ov 
Op 





x(t) = h] x(t) 4 IN | u(t) (12.2) 
Os 

where the derivatives are taken at (s°, p°). By construction, this linearized system 

has steady state (x,u) = (0,0). 

The behavior of the original system 12.1 is approximated by that of the linearized 
system 12.2 near the nominal operating point. In particular, the linearized model 
faithfully represents the response of the original system to small changes in the 
parameters (for which the function u(-) remains near zero). Standard sensitivity 
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analysis involves gauging the response of system 12.2 to constant (step) changes in 
the parameter levels. In extending this analysis to nonconstant perturbations, it is 
useful to introduce the notations used in systems and control theory for analyzing 
such systems. 

The standard model of a linear time-invariant input-output system has the form 


x(t) = Ax(t) + Bu(t) (12.3) 
y(t) = Cx(t) + Du(t) (12.4) 


where x is an n-vector, u is an r-vector, y is a q-vector, and A, B, C, and D are 
matrices of appropriate dimensions. The dynamics of the linearized model 12.2 take 
this form with 


A=N— and B=N— (12.5) 


The components of the input vector u can play a number of roles in the system. 
In control engineering, three of the most common are: reference input (providing an 
external signal to which the system should respond), control input (by which one 
subnetwork might regulate the activity of another), and disturbance (to incorporate 
the effect of perturbations). 

The vector y is referred to as the system output and represents a function of 
the state and input which is of specific interest. In addressing biochemical systems, 
there are several outputs which may be of interest, including species concentrations, 
reaction rates, pathway fluxes, transient times, and rates of entropy production 
(cf. section 5.8.1 of (Heinrich and Schuster, 1996)). In what follows, two output 
vectors of primary interest will be addressed. 

The first is the vector of independent species concentrations, or more precisely, 
the deviations of these concentrations from the nominal level. In the linearized 
model 12.2, these deviations are described by the state x. This choice of output is 
thus characterized by 


y(t) = x(t) (12.6) 


which correspond to the choice C = I (the n x n identity matrix) and D = 0. 

The second output of interest is the vector of reaction rates. Again, it is the 
deviation from the nominal rates which is the natural choice for y. This is approxi- 
mated by the linearization of the reaction rate function v(-,-) at the nominal point 
as follows: 


y(t) = a*t O (12.7) 


where the derivatives are evaluated at (s°,p°). This output takes the form of 


equation 12.4 with C = oa and D = Bue 
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12.1.2 Frequency Response 


Sensitivity analysis is concerned with determining the steady state response of 
a system to constant disturbances, for example, to an instantaneous change in 
the activity of an enzyme from one constant level to another. Extending that 
analysis to determination of the asymptotic response to arbitrary time-varying 
perturbations may seem a daunting task. Indeed, this is an intractable problem 
in general. However, when restricting to linear systems, a satisfactory result can be 
achieved. 

There are two features of linear systems that can be exploited in this analysis. 
The first is simply the linear nature of their input-output behavior which implies an 
additive property: provided the system starts with initial condition x(0) = O (which 
corresponds to the nominal steady state of the biochemical network), the output 
produced by the sum of two inputs is the sum of the outputs produced independently 
by the two inputs. That is, if input uj;(-) elicits output y;(-) and input ug(-) yields 
output yo(-), then input uj(-) + u2(-) leads to output yi(-) + yo(-). 

The additive property allows a reductionist approach to the analysis of system 
response: if a complicated input can be written as a sum of simpler signals, the 
response to each of these simpler inputs can be addressed separately and the 
original response can be found through a straightforward summation. This leads to 
a satisfactory procedure provided one is able to find a family of “simple” functions 
with the following two properties: 1) the family has to be “complete” in the sense 
that an arbitrary signal can be decomposed into a sum of functions chosen from 
this family; and 2) it must enjoy the property that the asymptotic response of a 
linear system to inputs chosen from the family is easily characterized. The family 
of sinusoids (sines and cosines) satisfies both of these conditions. 

The decomposition of a signal f(t) into a combination of sinusoids is the founda- 
tion of Fourier analysis (Strang, 1986; Lynn, 1982), which allows the description of 
f(t) in terms of its Fourier transform F (w) defined as a function of frequency w by 


Fs T T teei dt. (12.8) 


The transform provides a record of the frequency content of f(t) and is an 
alternative characterization of the original function. While complete recovery of 
a signal from its transform is difficult to achieve, important aspects of the nature of 
the signal can be gleaned directly from the graph of the transform. In particular, one 
can determine what sort of variations dominate the signal (low frequency or high 
frequency) by comparing the content at various frequencies. Quickly-varying signals 
have transforms with most of their content at high frequencies, while slowly-varying 
functions show primarily low-frequency content. 

The second crucial property of linear systems that will be used is that, as 
mentioned above, their response to sinusoidal inputs can be easily described. Indeed, 
it is this property of sines and cosines which makes Fourier analysis a useful tool 
for analyzing linear time-invariant systems. 
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Consider the case of a system for which the input and output are scalars, referred 
to as Single-Input-Single-Output (SISO) systems. For such systems, a sinusoidal 
input of frequency w, for example, u(t) = sin(wt), generates an output which is, 
after an initial transient, a sinusoid of the same frequency: y(t) = Asin(wt + ¢). 
This response can be described by two numbers: A, the amplitude of the oscillatory 
output, known as the system gain; and @¢, the phase of the oscillatory output, 
known as the phase shift. For systems which are not SISO, there is one such pair of 
numbers which characterizes the response of each output channel (or component) 
to each input channel. The particular gain and phase shift which correspond to each 
frequency w can be conveniently described by the assignment of a single complex 
number Ae’? (with modulus A and argument ¢) to each frequency. The resulting 
complex-valued function is called the frequency response. The frequency response 
for system 12.3 is 


H(iw) = C(iwI — A) B + D, for all real w (12.9) 


This function will in general be matrix-valued but is scalar-valued in the SISO case. 
The frequency response can be derived through an algebraic calculation involving 
the Laplace transform of the system (Morris, 2001). The Laplace transform is 
a standard tool in the analysis of linear systems. It allows a linear differential 
equation, stated in the time domain, to be restated as an algebraic equation, 
in the Laplace domain — the complex plane. The behavior of the system in the 
Laplace domain is characterized by its transfer function, which is recovered from 
equation 12.9 by replacing the purely imaginary argument iw with a general complex 
variable s. 

In addressing biochemical networks, system response can be described as in 
equation 12.9. Recall, the matrices A and B describing the dynamics were derived 
in equation 12.5. If the independent species concentrations are chosen as output we 
have (from equation 12.6) C = I and D = 0, and so the frequency response takes 
the form 








o o 
H, (iw) = (iwl — N ar Nae (12.10) 
Alternatively, for the reaction rate output, equation 12.7 gives C = a and D = = 
so that 
Ov Ov ðv Ov 
H, (iw) = iw] — N—)-!N 12.11 
(iw) = z 7s) w Op (12.11) 


Each element of these matrix-valued frequency responses is a scalar-valued func- 
tion which describes the response of one output channel to one input channel. For 
each such input/output channel pair, the complex-valued function which describes 
the system behavior can be plotted in a number of ways. Perhaps the most useful 
of these visualizations is the Bode plot, in which the magnitude and argument of 
the frequency response are plotted separately. The magnitude of the function value 
(the system gain) is plotted on a log-log scale, where the gain is measured in deci- 
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bels (dB) (defined by r = 20logi)r dB; note that 0dB corresponds to a gain of 
one). The argument of the function value (the phase shift) appears on a semi-log 
plot, with log frequency plotted against phase in degrees. Bode plots will be used 
to illustrate frequency responses in the remainder of this chapter. 

The response of a system to a constant input (which can be thought of as a 
sinusoid with frequency zero) is characterized by the frequency response at w = 0. 
Making this substitution into equations 12.10 and 12.11 the response of the system 
is found as 


o OV, n OV = OV OV 
H, (0) = - (Nz) Np and H,(0)= 7s 


These expressions can be derived from a standard sensitivity analysis of the system, 
such as that provided by MCA. 


Ov Ov 


—1 
) Nat p (12.12) 





12.1.3 Illustration of the Frequency Response 


The effect of negative feedback will be illustrated by an analysis of the bacterial 
trp operon, which is responsible for tryptophan production. A number of models of 
bacterial tryptophan biosynthesis have appeared in the literature, originating with 
the work of Goodwin (1965). The model of Xiu et al. (1997) will be considered here. 
A more complete model, including explicit time delays, has also appeared (Santillán 
and Mackey, 2001). 

The Xiu model involves three state variables: the concentration of tryptophan P, 
the concentration of mRNA transcribed from the trp operon M, and the amount of 
expressed enzyme Æ. (It is an abstraction of the model that tryptophan synthesis 
is catalyzed by a single enzyme.) The dynamics of the model describe production 
of mRNA, enzyme, and tryptophan, as well as the degradation and dilution (due 
to cell growth) of each of these species. Cellular consumption of tryptophan is 
also included. In addition, two negative feedbacks are incorporated. The first is the 
inhibition of enzyme EF by tryptophan. The second is the repression of transcription 
of mRNA, also tryptophan dependent. This genetic regulation is achieved through 
the activity of a repressor molecule R which, when bound to two units of tryptophan, 
interacts with an operator region of the operon, thus blocking transcription. 

The dynamics, indicated in figure 12.1 can be described by the equations 











dx z+1 

= 12.1 
a" Ipank as) 
d 
77 =q — (a2 + u)y (12.14) 
dz A z A 
~ v IF (a3 +u)z are as(1 4 asu)u- Lk (12.15) 


a1 = 0.9, œz = 0.02, a3 =0, ag = 0.024, œs = 430, 
ag = —7.5, u = 0.00936, ki = 2283, k = 0.05. 
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> M >, 
Figure 12.1 Tryptophan biosynthesis: reaction scheme. 





where x, y, and z are dimensionless concentrations of mRNA, enzyme, and trypto- 
phan respectively. The dimensionless parameters are described by Xiu et al. (1997). 
The behavior of the system under changes in the value of a5 will be addressed, with 
a nominal value of a5 = 430. The effect of the enzyme inhibition on this response 
will be illustrated by considering two values of the parameter r: strong feedback is 
exhibited with r = 10, while weaker feedback will be addressed by taking r = 5. 
The concentration of tryptophan (z) is taken as the output of the system. The 
magnitude frequency responses to changes in as are shown in figure 12.2. 
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Figure 12.2 Frequency response for trp operon model. 


In this model, œs describes the effect of cellular demand for tryptophan (Xiu et al., 
1997). The behavior shown in the figure is typical of a negative feedback system. 
With weak feedback (r = 5), the effect of the input on asymptotic tryptophan 
levels decreases monotonically as the frequency grows larger. Strengthening the 
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feedback (to r = 10) has two effects. The first is that the low frequency response 
is improved: as a standard sensitivity analysis would show, increasing the feedback 
reduces the effect of perturbations on the output. The other feature of stronger 
negative feedback is an increase in sensitivity at higher frequencies — to the point 
that the feedback actually makes the system more sensitive to disturbances over a 
certain frequency range. 

The knowledge that negative feedback can introduce such resonance effects is 
crucial to the design of feedback systems. The trade-off between improved response 
at low frequencies and increased sensitivity at higher frequencies can be made 
explicit (for certain linear systems) by a constraint known as Bode’s integral 
formula (Bode, 1945). System designers work around this “performance constraint” 
by implementing feedback that introduces increased sensitivity only at frequency 
ranges over which the system is unlikely to be excited. One could postulate that 
the same is true of feedback mechanisms within the cell: they have been crafted by 
natural selection in such a way that a trade-off is made between improved response 
to common low-frequency inputs and amplification of rare disturbances at higher 
frequencies. 

Having illustrated the effect of negative feedback on the frequency response, we 
now turn to a more complete description of feedback strategies, highlighting the 
critical role of integral feedback. 





12.2 Integral Feedback Control: From Homeostasis to Chemotaxis 


Homeostasis is the dynamic self-regulation of a system to maintain essential vari- 
ables within limits necessary for acceptable performance in the presence of unex- 
pected disturbances. It is one of the defining features of living organisms. Home- 
ostasis is achieved through countless control systems that regulate the multiplicity 
of biological processes. This intricate control network ensures robustness in the 
constantly changing real world; see chapter 2. 

A related phenomenon is that of sensory adaptation in which the sensory system 
adjusts itself to changing environmental conditions for peak performance. For 
example, one’s vision can adapt to the ambient background light intensity (bright 
or dim) so that there is sufficient contrast to detect objects. In signal transduction 
pathways, negative feedback regulation causes the output to return toward its 
prestimulus value after the application of a step increase in the input. During 
movement toward a chemical signal (chemotaxis), this type of adaptation facilitates 
the sensing of chemical gradients over a wide range of concentrations. 

In this section we will discuss how one particular type of control system, integral 
feedback control, plays a crucial role in both homeostasis and sensory adaptation. 
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12.2.1 Negative Feedback Control 


The most fundamental control system is negative feedback control. In such a system, 
the controller measures the difference between the current output and the desired 
output and based on this error takes some control action that reduces the error (see 
figure 12.3A). Negative feedback promotes regulation around a set point, stability, 
and robustness when performing some task. An important goal of the controller is 
to minimize the effect of the disturbances dı and dz on the output y. As shown 
in section 12.3, positive feedback increases the difference between the current and 
previous output, and thereby acts as an amplifier, but with potentially destabilizing 
consequences. 





Controller 


Figure 12.3 A) Block diagram of typical control system. The system to be controlled, 
which is usually referred to in control engineering as the “plant,” P, takes an input ui 
and converts it into an output y, which is typically normalized to represent the deviation 
between the current output and the desired output (that is, the error). The disturbances 
dı and dz perturb the input and output, respectively. The controller measures the error 
and takes an appropriate control action to reduce y. B) Block diagram of integral feedback 
control system. The input is u, the gain of the plant is k, and the controller is an integrator. 
From this diagram, it is clear that the feedback term x = f ydt. Thus, dx/dt = y and at 
steady state, y — 0 as long as the system is stable. 


One can classify controllers according to the mathematical operations used 
to convert the error signal into a control action. In today’s world of ultrafast 
computers, one can design fancy digital controllers that implement arbitrarily 
complex strategies. In the past, however, control engineers resorted to three basic 
types of feedback control: (1) proportional control: the error term is multiplied 
by a constant before being fed back; (2) integral control: error is integrated; or 
(3) derivative control: error is differentiated. Each type of feedback has beneficial 
features. Proportional control corrects for “current” errors. One can adjust the 
amount of feedback by increasing or decreasing the constant factor. Higher feedback 
gain is better at rejecting disturbances, but it also causes the system to become less 
stable. Integral control eliminates steady state errors. Finally, derivative control 
provides “anticipation” of upcoming changes, which increases damping, improves 
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Controller Controller Closed-loop 
Type (time domain) (Laplace domain) transfer function 
Proportional uc(t) = kp y(t) U.(s) = kp Y (s) Se 
Integral uc(t) = kr f y(t)dt Uc(s) = &Y (8) ao) 
Derivative uc(t)=kp(#)  Ue(s) = skp Y(s) res 


Table 12.1 Controller types with time- and frequency-domain descriptions. The last 
column shows the transfer function from D(s) to U(s). 


stability, and decreases transient errors. Mathematically, we can represent these 
controllers as in table 12.1. 

By substituting these controllers into the feedback system shown in figure 12.3A, 
one can calculate the relationship (transfer function) between the disturbance inputs 
and the output using the block diagram and some simple algebra. The transfer 
function from dı to y, found in table 12.1, represents the sensitivity of the system 
to the disturbance. 
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Figure 12.4 A) Time course of response of three control systems to a unit step 
disturbance (dı = 1/s, kp = kr = kp = 1). Proportional (solid), integral (dashed), and 
derivative (dotted) control systems are depicted. (B) Bode plot of the sensitivity function 
Y(s)/D1(s). Recall that 0dB corresponds to an output that has the same magnitude as 
the input; that is, a system with unity gain. 


From these transfer functions one can run simulations of the output responses to 
an input disturbance signal and compare the three controllers. Applying a unit 
step disturbance at dı(t) (Di(s) = 1/s) produces the time histories shown in 
figure 12.4A. Proportional control attenuates the disturbance at both short and 
long time scales; integral control completely neutralizes step disturbances at steady 
state but has little effect early on; derivative control is the opposite, blocking 
the immediate change but showing no attenuation at steady state. An alternative 
representation of these dynamics in the frequency domain is possible by taking 
the Fourier transform of the time domain signals. This frequency response of the 
transfer function or Bode plot offers perhaps a simpler depiction of the above 
comparison: one readily observes the disturbance attenuation at low frequencies 
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by integral control, attenuation at high frequencies by derivative control, and 
attenuation at all frequencies by proportional control. 

To capture the properties of all three controllers, one can combine them into 
a proportional-integral-derivative (PID) control system. The transfer function for 
a PID controller is written as PID(s) = kp + kr/s + kps. One can obtain the 
desired performance by tuning the parameters kp, kr, and kp to obtain the best 
balance of steady state error, transient behavior, and stability. From a Bode plot 
perspective, manipulating the three coefficients allows one to shape the Bode plot 
to obtain the optimal disturbance attenuation at the critical frequencies. Thinking 
in terms of transfer functions, the PID controller offers sufficient flexibility to place 
the dominant pole(s)4 of the system at appropriate location(s) for the desired 
system behavior. 

Although the above analysis, based on transfer functions, is for linear feedback 
systems, the general lessons also apply to nonlinear systems. Indeed, it is helpful 
to think about complex digital controllers in these simpler terms in order to gain 
intuition. In addition, PID controllers are still widely used for systems possessing 
slower dynamics, such as in process control. 


12.2.2 Primer on Integral Control 


Now we will focus our attention on integral feedback control because of the 
remarkable property of perfect regulation at steady state against step disturbances. 
More importantly, this regulation is robust to internal and external perturbations. 
For example, the presence of an additional disturbance dz does not affect the perfect 
regulation of the step disturbance dı. Likewise, the steady state output is robust 
to variations in the parameters of the plant. Thus, integral control ensures the 
robust tracking of a specific steady state value so that the error approaches zero 
despite uncertainty in internal and external conditions. Exceptions arise in the case 
of higher-order unstable disturbance inputs (for example, ramp inputs) or when the 
controller itself is perturbed. 

Integral controllers are ubiquitous in man-made systems. For example, the cruise 
control in a car uses integral control to maintain robustly the speed of the vehicle 
at the set point despite disturbances such as the wind or a hill. In an airplane, 
integral control loops are found at every level from CPUs to instruments to the 
entire vehicle. A single oil refinery possesses more than 10,000 integral feedback 
loops. 

A block diagram of a simple linear system with integral feedback illustrates its 
chief features; see figure 12.3B. The plant or network, represented by the block with 
gain k (P(s) = k), takes the input u and produces the output yı. The difference 
between the output yı and the desired steady state output yo is the error term y. 
Then, y is integrated and fed back into the system. The key to integral control is 
that the feedback term z = f y so that 


dx 


caom 12.1 
r (12.16) 
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At steady state the time derivatives of the variables go to 0, so that y — 0 as 
t — oo independent of the values of the input u and the gain k. Hence, the error 
asymptotically approaches zero as long as the system is stable. It is important to 
note that this analysis does not depend on the fact that the system is linear; the 
perfect regulation property of integral feedback also applies to nonlinear systems. 

Transfer function and state-space interpretations. As we described above, 
for a typical linear feedback system with plant P(s) and controller C(s), the 
sensitivity transfer function from the disturbance input Dı(s) to the output is 
S(s) = P(s)/(1 + P(s)C(s)). We can then prove that if the input signal is a step 
of size A, then the output will approach zero asymptotically in time if and only if 
the sensitivity function has a zero at the origin (Doyle et al., 1992). 

In transfer function form, a step input of size A is described by U;(s) = A/s. 
This leads to the output Y(s) = U;(s)S(s) = AS(s)/s. If the feedback system is 
stable, then by the final value theorem (Doyle et al., 1992), y(t) + AS(0) as t — ov. 
Clearly, the right-hand side is zero if and only if S(0) = 0. An integral feedback 
system possesses such a zero at the origin: S(s) = sP(s)/(s + P(s)). 

Alternatively, one can represent the dynamics of a system in state-space form as 
a set of first order differential equations: 


— = f(x,u), and y=g(z,u) (12.17) 


The vector x is the state of the system (typically describing the concentrations of 
the species) and u is the input. For a linear system, we can simplify this description 
to the matrix form of equation 12.3. One can introduce a new integral feedback 
state z with the dynamics 


d 
- =Cr+Du=y (12.18) 


In this manner, integral feedback is implemented, and these dynamics guarantee 
that the steady state error approaches zero no matter what the values of u, A, B, 
C, and D as long as the system is stable. 


12.2.3 Necessity of Integral Control and the Internal Model Principle 


The previous section demonstrated the sufficiency of integral control for robust 
perfect regulation. What about necessity? Is it true that any system exhibiting 
robust perfect regulation must contain integral feedback? A simple necessity proof 
for linear systems is provided that relies on the state-space description given above. 

At steady state, dx/dt = 0, so that x = —A~!Bu and y = (D—CA™'B)u. Thus, 
y = 0 at steady state for all constant u, if and only if either 


A B 


=i 12.19 
P (12.19) 


le D| =0, or det 








12.2 Integral Feedback Control: From Homeostasis to Chemotaxis 255 


The former is the trivial case when y(t) = 0 for all t, and the latter holds if and 
only if there exists a k Æ 0 such that 


k[a Bl=|c D] (12.20) 


Thus, defining z = kx, we have dz/dt = kt = k(Ax + Bu) = Cx + Du = y, which 
is the standard integral control equation. 

This necessity statement suggests that integral control is prevalent at all levels 
of biology from cellular regulation to organismal physiology to ecosystem balance. 
Just as integral feedback is used ubiquitously in man-made systems, it must also be 
a common control strategy in biological systems given the requirement that internal 
variables maintain constant steady state values despite step disturbance changes. 

The necessity of integral control applies to step changes. The internal model 
principle (IMP) generalizes this notion. The principle states that the robust tracking 
of an arbitrary signal requires a model of that signal to be in the controller. 
The intuition is that the internal model counteracts the external signal so that 
y(t) — 0 as t — oo even in the presence of parameter perturbations. For example, 
a controller containing an integrator (C(s) = 1/s) is necessary to track robustly 
a step signal (U(s) = 1/s). Francis and Wonham (1976) proved IMP for linear 
systems. Isidori and Byrnes (1990) have established a general framework for IMP 
in nonlinear systems using techniques from differential geometry. Sontag (2003) 
has provided a succinct statement of IMP relevant to biological systems. However, 
these topics are beyond the scope of this chapter. It is important to appreciate 
that living systems are subject not only to constant, or step, changes, but also 
to perturbations that involve steadily rising or falling signals, and to even more 
complex disturbance behaviors (for example, neural signals). In order to maintain 
homeostasis, the feedback control system implemented by the biological network 
must contain an internal model of the disturbance according to IMP. An area for 
future research is cataloging these control structures and addressing the question 
of how biology builds these internal models. 


12.2.4 Examples of Integral Control in Biology 


Here we illustrate two biological examples of integral control. One is in the area of 
blood calcium homeostasis and the other is in the area of sensory adaptation. 

Blood calcium regulation. The level of calcium in the blood is carefully reg- 
ulated against disturbances in calcium utilization and uptake. The two compounds 
parathyroid hormone (PTH) and vitamin D (VitD) play a central role in this reg- 
ulation. They control how much calcium is introduced into the blood from the 
intestine (vitamin D) and from the bone (PTH). El-‘Samad and Khammash have 
formulated a model, illustrated schematically in figure 12.5A, of these dynamics in 
mammals (El-Samad et al., 2002). 

A disturbance dı affects the rate at which calcium is taken up or removed from 
the blood; this disturbance is compensated for by the action of PTH and vitamin 
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C(s) = kp + kı/s CheB 


Figure 12.5 A) Model of blood calcium regulation (El-Samad et al., 2002). A distur- 
bance, dı, in calcium dynamics is attenuated by the control action ue, mediated by PTH 
and vitamin D, which is produced by the block C representing a PI controller. The error y 
is the current level of calcium, [Ca], minus the set point level of calcium, [Ca],. B) Model of 
regulation of bacterial chemoreceptors. Receptor activity depends on the ligand aspartate 
as well as the degree of receptor methylation (feedback term). An integral controller is 
implemented because CheB only demethylates active receptors. This control system en- 
sures that the steady state level of receptor activity is constant despite changes in ligand 
concentration or in receptor numbers. 


D, ue : d[Ca]/dt = ue + dı. The error is the deviation from the steady state blood 
calcium level (y = [Ca] — [Ca]o). It is known from physiological measurements 
that the level of PTH is proportional to this error (y x [PT'H]). In addition, 
the rate of production of vitamin D is proportional to the concentration of PTH, 
and assuming that vitamin D has a slow degradation rate on the time scales of 
interest, we have d[VitD]/dt = k[PTH]. Thus, we can calculate the error in terms 
of [PTH] or [VitD]: y = ki [PTH] = kod[VitD]/dt. Finally, if we approximate the 
rate of calcium absorption from the intestine or bone as linear functions of [VitD| 
and [PTH], respectively, we have the following equation for the control action: 
uc = k3[PTH] + k4[VitD] = kpy + kr f y. Thus, this system exhibits proportional- 
integral (PI) control. 

Bacterial chemotaxis signaling pathway. Bacteria are able to sense gra- 
dients of attractants and repellents. The signal transduction pathway responsible 
for this behavior possesses several special features to ensure both exquisite sensi- 
tivity and wide dynamic range. One such feature is perfect adaptation: the output 
of the pathway (flagellar rotation) exactly returns to its prestimulus value even 
in the presence of continuous stimulation so that the steady state level of output 
activity asymptotically approaches a constant value independent of the attractant 
concentration. The bacterial chemotaxis system is a two-component signaling sys- 
tem (Stock et al., 1991). The receptor complex, which consists of the receptor, the 
histidine kinase CheA, and the adaptor protein CheW, phosphorylates the response 
regulator CheY. Phosphorylated CheY interacts with the flagellar motor to induce 
clockwise (CW) rotation and tumbling behavior. The attractant inhibits the re- 
ceptor complex resulting in counterclockwise (CCW) flagellar rotation and straight 
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runs. Receptor complex activity is regulated by methylation, which mediates adap- 
tation. Methylation by CheR increases receptor activity; demethylation by CheB 
decreases activity. Although there is no direct evidence, we assume that CheB senses 
the activity state of the receptor by only demethylating active receptor complexes. 
This assumption results in an important negative feedback loop; see figure 12.5B. 

Robust versus non-robust perfect adaptation. An important question is 
whether perfect adaptation in bacterial chemotaxis is robust to changes in internal 
and external conditions. Alon, Leibler, and colleagues experimentally tested the 
robustness of perfect adaptation to dramatic changes in the concentration of key 
components of this pathway (Alon et al., 1999). They demonstrated that as the 
methylase CheR was varied over a 50-fold range, the adaptation precision remained 
close to perfect. They went on to show that perfect adaptation was robust not only 
to changes in levels of CheR, but also to changes in the concentration of CheB, 
receptor, and CheY. 

Is it possible to model perfect adaptation in bacterial chemotaxis? Most models 
in the literature indeed were able to reproduce perfect adaptation, but only through 
fine-tuning of the model parameters. Perfect adaptation is nonrobust in these 
models because altering a parameter disrupts perfect adaptation. This can be 
evaluated by systematically varying the model parameters and testing for perfect 
adaptation using continuation methods (Yi et al., 2000). For example, varying 
the total receptor concentration over a 100-fold range in a particular model of 
bacterial chemotaxis (Spiro et al., 1997), one observes perfect adaptation for only 
one particular receptor concentration, 8 uM. This is an example of non-robust 
perfect adaptation. 

Alternatively, one can imagine that perfect adaptation is a structural property of 
the system, insensitive to parameter variation, perhaps resulting from a particular 
feedback control mechanism. For example, perfect adaptation was robust to a 100- 
fold change in receptor concentration in another model by Barkai-Alon-Leibler 
(BAL) (Barkai and Leibler, 1997). Varying the levels of several other components 
in the model did not disrupt perfect adaptation. The necessity of integral control 
argues that an integral control mechanism must be present in the BAL model to 
explain this robust regulation. 

Implementation of integral control in the bacterial chemotaxis system. 
How is integral control implemented in the BAL model of the chemotaxis system? 
A simplified version of the derivation is shown here. The variable x represents the 
methylation state of the receptor. The change in x: dx/dt, equals the methyla- 
tion rate r minus the demethylation rate. Using the assumption that CheB only 
demethylates active receptor complexes so that the demethylation rate is propor- 
tional to A, we obtain the following: dx/dt = r — bA. At steady state, « = 0, 
r = bA, and hence the steady state activity level Ag = r/b. We can rewrite this 
as the familiar dx/dt = —b(A — A?) = —by. The key point is that if r and b are 
independent of u, then this system will exhibit perfect adaptation that is robust to 
changes in the system parameters. 
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12.3 Feedback in Cellular Communication 


In the previous section we saw how integral control can be used to achieve robust 
perfect adaptation in the signaling pathway regulating chemotaxis in EF. coli. We 
now consider other uses of feedback mechanisms in cell signaling pathways. 


12.3.1 Signal Detection: Fast Excitation, Slow Inhibition 


One of the roles of the integral feedback mechanism employed in E. coli chemotaxis 
is that it allows the cell to determine the rate at which the chemoattractant 
concentration is varying temporally around the cell. Thus, the objective of the 
integral feedback control is not necessarily to reject the step disturbances, but to 
generate signals that mirror the rate of change in the external signal. 

An alternative role of the mechanism can be envisioned (Koshland Jr. et al., 1982; 
Sontag, 2003). A cell may need to monitor the external environment for sudden 
changes, to which it can then adapt. A mechanism for achieving signal detection is 
then required so that the cell can alter its behavior in response to this change. 

A digital logic mechanism demonstrating how this monitoring can take place is 
shown in figure 12.6. The current state of the environment is sensed continuously 
and compared to the previous state through an EXCLUSIVE-OR gate (X-OR). This 
circuit has a “low” output if the two inputs coincide, but a “high” if they differ. 
Thus, if the present and past states of the environment differ, a transient signal is 
generated that can trigger a response. 





environment 





response time 


Figure 12.6 Signal detection scheme. A) Changes in the external environment can be 
detected by the scheme outlined here. The environmental signal, x, is compared with 
stored copies of this signal, y, using an X-OR signal. This generates a pulse whenever the 
present state does not match the previous one. B) Sample signal levels. The two changes 
in the environment lead to two response pulses. 


For biological signaling, a similar transient response can be effected by a mecha- 
nism in which the current state of the environment generates a fast excitory signal 
that stimulates a response regulator (Koshland Jr. et al., 1982; Levchenko and 
Iglesias, 2002); see figure 12.7. The environment also generates a slower inhibitory 
signal on this same response regulator. Whenever the state of the environment is 
constant, the positive and negative influences balance and the response returns to 


12.8 Feedback in Cellular Communication 259 


basal levels. Recently, this scheme was used to create a synthetic gene network to 
allow cell-to-cell communication in E. coli (Basu et al., 2004). 


EO) =n 
RÀ Hoo 
time 
Kt) 


time 


Figure 12.7 Fast excitation, slow inhibition. Biochemical scheme for implementing the 
signal detection scheme of figure 12.6 (Insert). As seen in the time courses, a rise in 
the excitation signal, Æ, stimulates an inhibitory signal, J, whose level rises slowly. The 
excitation leads to a response, R, which is then attenuated by the inhibitor. Together this 
leads to a short pulse in the response. 


A mathematical model that effects this general mechanism is given by 


dI 
—=-k_jI+khE 
di E + Ky 
dR 

—=—-k_pIR+k EF 
di 21h + Kg 


A change in the environmental signal, Æ, leads to both a fast increase in the 
response, R, as well as a slower buildup in inhibition, J; see figure 12.7. At steady 


state, 
E 
J= (ky /k_1)E and R= (ka/k—2) > (12.21) 
Together, these two equations imply that, at steady state, 
kək—ı 
R > R* = —— 12.22 
=e k_oky ( ) 


ensuring that the level of response is independent of that of E. 
Though the system looks like a purely feedforward control mechanism, integral 
feedback is still being employed. Rewriting the differential equation as 


dR = her (r- =F) 


ko kıl — kE 
= —k_ol = 
? (z k k—2 ( kıl )) 


= -k2I(R- R*+ D) 
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Figure 12.8 Integral control mechanism. A) The fast excitation, slow inhibition mecha- 
nism described in the text can be expressed in block-diagram form. A signal consisting of 
the response, R, plus a transient signal, D, is compared against a constant reference signal, 
R*. After scaling by the time-varying gain k2/(t), the error is integrated. B) The magni- 
tude Bode plot of the linearized model of this mechanism exhibits the low frequency rise 
proportional to frequency that is characteristic of a closed-loop system that has integral 
control. Parameters used: ke = 5, k-1 = 0.01, and k-2Jp = 1. 


where 





kə (dI/dt 
D= aus 12.2 
al I ) (229) 


approaches zero as t — oo provided that I > 0. Hence, the system can be redrawn 
as in figure 12.8A where the integral control feedback is evident. 

This can also be observed by computing the transfer function of the linearized 
model of this system. In particular, the closed-loop transfer function between the 
environment and the output is 

R(s) kos 


Els) (s +k1)(s + kal) (12.24) 


where Jp is the inhibitor concentration about the operating point. The frequency 





response of this transfer function is shown in figure 12.8B. This transfer function 
demonstrates that the system has two poles, corresponding to the off-rates for the 
inhibitor (k—1ı) and response-regulator (k_2/9) equations. It also has a zero at s = 0, 
which is a consequence of a closed-loop system that has integral feedback. 
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12.3.2 Amplifying the Signal through Positive Feedback 


The adaptive property of the fast excitation, slow inhibition mechanism is not 
affected by model parameter values; it is clear from the discussion above that 
the kinetic coefficients can be altered, but that the response, after a change in 
the stimulus, will return to its prestimulus concentration. These concentrations, 
however, will be affected by the parameter values, echoing the results obtained in 
bacterial chemotaxis models (Alon et al., 1999; Yi et al., 2000). However, it is also 
known that the external gradient is not amplified with this single mechanism and 
hence cannot explain all the observed response of chemotaxing eukaryotic cells, such 
as neutrophils and the amoeba D. discoideum. To amplify the effect of the stimulus, 
several mechanisms have been proposed (Iglesias and Levchenko, 2002; Levchenko 
and Iglesias, 2002). 

One means of amplifying the response is to add a positive feedback loop down- 
stream, as shown in figure 12.9A. Suppose that the response of the sensing mecha- 
nism activates a downstream autocatalytic effector according to: 

dX kX” 


= —k_3X + kR + 


— — 12.2 
dt 1+ kX” ( a) 


The parameter ky, as well as the Hill coefficient, n, denote the strength of the 
positive feedback. For now, assume that n = 1. In the absence of this feedback 
(ky = 0) the concentration of X is proportional to that of R, with proportionality 
constant equal to 1/k_3. Now, assume that kp > 0 and that k,X « 1. Then 
X is, once again, proportional to R, but the proportionality constant is now 
1/(k_-3 — ky) > 1/k_3. This can be arbitrarily large if ky ~ k_3. In this situation, 
of course, saturation conditions exist; see figure 12.9B. 


12.3.3 Positive Feedback and Cooperativity: Hysteretic Behavior 


More interesting behavior can arise if the Hill coefficient of the feedback term is 
greater than one. In this case, the response does not vary significantly for small 
changes in the stimulus; see figure 12.9C. However, once a threshold value is reached, 
the response changes abruptly to a higher level. At this point, the response is 
once again relatively insensitive of the stimulus. To return to prestimulus levels, 
a significant decrease in the stimulus is needed. This hysteretic response arising 
from a bistable system is common in engineering circuits. For example, the Schmitt 
trigger implements a bistable circuit by closing a positive feedback loop around 
an operational amplifier (Sedra and Smith, 2004). There is experimental evidence 
that cells also rely on bistability for regulation (Xiong and Ferrell Jr., 2003). 
Synthetic biological switches have also been designed and built based on these 
principles (Hasty et al., 2000; Ozbudak et al., 2004). 

Several models based on bistable signaling systems have been proposed to account 
for the large chemoattractant-induced responses in cells (Meinhardt, 1999; Narang 
et al., 2001; Postma and Van Haastert, 2001). However, because of the hysteretic 
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Figure 12.9 Amplification through positive feedback. A) To amplify the response to an 
external environmental signal, the sensing mechanism of figure 12.7 can be followed by an 
amplification mechanism that relies of positive feedback. B) Steady state concentration 
of X for the system described by equation 12.25 as a function of the input concentration 
R. For comparison, the response of the system assuming linear feedback, n = 1 (solid) is 
plotted alongside that of the open-loop system, ky = 0, (dotted line). Coefficients used 
are k_3 = k3 = kf = 1 and ks = 0.01. C) Hysteretic behavior observed when n = 2 
(solid). When the input concentration is low, R <5, or high, R = 8, the system exhibits 
only one steady state response. For intermediate values, two stable (solid lines), as well 
as one unstable (dashed line), steady states are present. This bistable system can lead to 
discontinuous behavior. When the input is increased beyond the transition level, a sudden 
rise in the output can be observed, as the system moves from the low level of response to 
the higher level, shown by the arrow. To return to the lower level, the input signal must 
be reduced significantly. The response in the absence of feedback is shown for comparison 
(dotted). 


nature of their response, these models cannot account for the behavior seen in 
unpolarized D. discoideum cells. These unpolarized amoebae are equally sensitive 
around the whole membrane, and so when they are subjected to sudden changes in 
the concentration gradient, these cells can rapidly respond (Iglesias and Levchenko, 
2002; Devreotes and Janetopoulos, 2003). A hysteretic switch, however, could 
account for the response of polarized cells, which “remember” their polarization. 
These cells, when subjected to a change in the chemoattractant gradient, tend to 
turn. 

More recently, a means for amplifying the response to an external signal has been 
suggested in which parallel sensing mechanisms, acting independently, cooperate 
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to produce an amplified response to the chemoattractant stimulus (Ma et al., 
2004b). The advantage of such a mechanism for amplifying the signal is that it 
employs a redundant mechanism. If, for some reason, one of the two pathways is 
impeded, the cell can still detect stimuli. Such mechanisms have been demonstrated 
experimentally in D. discoideum cells in which one of the two pathways is disrupted, 
either through knockout mutations or pharmacological inhibitors. These cells are 
still able to sense external stimuli, though chemotaxis is partially impaired (Iijima 
and Devreotes, 2002; Funamoto et al., 2002). 


12.3.4 Oscillations: Positive and Negative Feedback Work Together 


Besides providing for strong amplification, one of the uses of positive feedback is as 
a means of obtaining oscillatory behavior. This was first used in engineering around 
1915 (Bennett, 1979). 

In biology, oscillatory systems are ubiquitous, from the circadian rhythms to 
genetic oscillators to the wave pattern observed in D. discoideum cell-to-cell com- 
munication (Goldbeter, 2002; Kruse and Jiilicher, 2005). Many of these systems 
rely on the interplay of positive and negative feedback loops. 


cAMP 


l signal Autocrine loop 






cell membrane 








Sensing Amplification cAMP 
synthesis 
Figure 12.10 Positive feedback through autocrine loop. In D. discoideum cells, the 
pathway that senses extracellular cAMP also stimulates the production of intracellular 
cAMP, which is then secreted from the cell. In doing so, a positive feedback loop is closed. 


Autocrine loops arise when a cell secretes a chemical that stimulates the secretory 
cell itself. For example, receptor binding of extracellular cAMP in the amoeba D. 
discoideum induces the activation of adenylyl cyclase of aggregation (ACA). This 
leads to the synthesis of intracellular cAMP from ATP. This cAMP is then secreted 
into the extracellular medium where it can diffuse away and thereby signal nearby 
cells. However, the secreted cAMP may also find its way back to the cell. In doing 
so it closes a positive feedback loop involving the chemoattractant sensory system; 
see figure 12.10. 
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A thorough analysis of autocrine loops requires that the stochastic nature of the 
diffusion of the signaling molecule be considered (Batsilas et al., 2003). However, if 
we ignore the spatial considerations of the diffusion, the analysis of the autocrine 
loop is relatively straightforward. The positive feedback path is coupled to the 
negative feedback mechanism, described above, that provides adaptation. Together, 
these intertwined positive and negative feedback loops lead to the formation of 
cAMP waves that can propagate as circular or spiral wave forms (Kessin, 2001). 

Several models have been proposed to describe the oscillatory behavior found 
in cAMP signaling in D. discoideum (Halloy et al., 1998; Laub and Loomis, 1998; 
Nagano, 2000; Iglesias, 2003). While these models differ in the biochemical identities 
of activators and inactivators, they all rely on an interplay between positive and 
negative feedback to achieve this periodic oscillation. 

Here we take a systems-level approach to the analysis of the autocrine loop and 
thereby demonstrate the use of several control-theoretic analysis techniques. Using 
the scheme described in figure 12.10, we assume that the production of intracellular 
cAMP is governed by 


Y 
~ = —k_4Y + k4X (12.26) 
and that this changes the extracellular concentration as: 
dE 
ao —k_sE + kY (12.27) 


The parameter ks can be used to describe the strength of the feedback loop. 

The system can now be treated as in figure 12.11A. Here, a linear system is 
found in feedback with a discontinuous nonlinear element, which can serve as an 
approximate model for the hysteretic bistable system described in section 12.3.3. 
This type of feedback system has been studied extensively in the control literature, 
where it is sometimes known as a relay or relaxation oscillator (Tsypkin, 1984; 
Varigonda and Georgiou, 2001). In these cases, the describing function method can 
be used (Khalil, 2002) to determine whether oscillatory behavior is possible. 

The analysis is predicated in computing an equivalent gain through the nonlinear 
system. Suppose that the input to the nonlinearity is the sinusoid x(t) = Asin(wot) 
and assume that the output y(t) is also periodic. It can then be described by the 
Fourier series. For example, if the nonlinearity involved is the hysteresis function 
described in figure 12.11A, then the sinusoidal input leads to a square wave output 
with Fourier series 





€ <a sin ([2k + 1][wo(t — to)]) . ô 
t) = to) = — 12.28 
w= 20 see , sin(woto) = 4 (12.28) 
If we focus on the fundamental frequency, k = 0, the nonlinearity has a gain 
equivalent to: 
n(A) = ——e7 tote (12.29) 


At 
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Figure 12.11 Describing function technique. A) The linear, time-invariant system with 
transfer function G(iw) is placed in feedback with a nonlinear system. If the system 
exhibits oscillatory behavior, then a sinusoidal input to the nonlinear system, x, gives 
rise to the square wave output y(t). The frequency of this output matches that of the 
input, though the hysteretic nature of the nonlinearity gives rise to a phase shift. In the 
describing function analysis technique, the square wave is expressed as a Fourier series, 
and an effective gain, n(A) is computed. B) To determine whether oscillatory behavior is 
predicted, a polar plot of the linear system’s transfer function is obtained. Here, the real 
and imaginary parts of G(iw) are plotted as a function of w (solid). On the same axes, the 
function 1/n(A) is plotted as a function of A (dotted). Points of intersection correspond to 
frequencies wo and amplitudes A where 1—G(wo)n(A) = 0. These are predicted magnitudes 
and frequencies of oscillation. C) Limit cycle oscillation arising from the system described 
in panel B. 


If there is oscillatory behavior, then the loop gain at the frequency wo must be one; 
that is 1 = G(iwo)n(A). Thus, by plotting G(iw) as a function of w and 1/n(A) 
as a function of A, we can seek for points where this equation is satisfied; see 
figure 12.11B. These points predict the existence of an oscillation with the given 
frequency and magnitude. 

The describing function technique is one means of studying oscillatory behavior 
analytically. Other techniques and methods include the use of bifurcation analysis 
(Ma and Iglesias, 2002). 


266 


Using Control Theory to Study Biology 





12.4 Discussion 


Control theory originated to meet the needs from a variety of engineering disciplines. 
In its essence, it facilitates the analysis and design of dynamical systems that are 
used to regulate the performance of a larger system. Almost always, these systems 
involve considerable feedback loops which can endow them with excellent robustness 
properties, but can also make them vulnerable; see chapter 2. 

In this chapter we have attempted to introduce readers to several of these 
tools. Because many of the techniques used in the analysis and design of control 
systems are based on linear analysis, we have emphasized this. While it is true 
that “real systems” are not linear, it is also true that considerable insight can be 
obtained regarding the dynamical behavior of nonlinear systems near equilibria by 
considering their linearizations (Khalil, 2002). This approach is especially fruitful 
when applied to systems whose architecture ensures they will spend most of their 
time near a steady state, including systems governed by homeostasis. We have 
shown how, by considering linear systems, powerful frequency-domain and transfer- 
function tools are available for analysis. 

We have also tried to illustrate how understanding of engineering control sys- 
tems can lead to some intuition as to the system behavior of biological systems. 
For example, knowledge of the internal model principle helps evaluate models of 
perfect adaptation in biology. Similarly, understanding how hysteretic switches and 
amplifiers arise out of positive feedback may lead to a better understanding of how 
these behaviors arise in biology. 

Finally, we note that, historically, control theory first arose out of a need to 
understand the behavior of systems (Bennett, 1979). This theory was then used to 
design and engineer better systems. It is not difficult to foresee that, in biology, 
control theory may follow the same path. At first, we expect that both existing and 
new tools will be used to analyze existing biological systems. However, we expect 
that these tools will later allow us to design and implement synthetic biological 
systems. In fact, we now see the first steps in this process (Hasty et al., 2002). 
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Notes 


1. For a rational transfer function G(s) = N(s)/D(s) where N(s) and D(s) are 
polynomials with no common roots, the poles (respectively zeros) of the transfer 
function are the roots of the denominator (respectively numerator) polynomials. 
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For example, the transfer function G(s) = (s + 1)/(s? +1) has one zero at —1 and 
two poles at +i. Much of classical control theory deals with the use of feedback to 
manipulate the location of the poles of the closed-loop transfer function. See (Doyle 
et al., 1992) for examples. 








13 Synthetic Gene Regulatory Systems 


Mads Keern and Ron Weiss 


In parallel with the development of high-throughput technologies fueling systems 
biology, advances in modeling of biological systems and in synthesis of long DNA 
fragments with arbitrary nucleotide sequences have fostered the emergence of a 
nascent field termed synthetic biology. At its core, this field uses recombinant DNA 
manipulation techniques to design and embed complex “programmed” functions into 
living organisms. An important notion that pervades most of the work in synthetic 
biology is the use of mathematical models for forward design. As such, systems and 
synthetic biology can be viewed as being two sides of the same coin. While systems 
biology attempts to unravel how the set of instructions encoded by an organism’s 
DNA orchestrates its phenotypical complexity, synthetic biology aims to create 
cells with desirable behaviors through the integration of additional instructions. 
This can be achieved by first investigating which network architectures support the 
desired outcome and then augmenting the genotype accordingly. The construction 
of synthetic gene regulatory systems can thus help understand natural systems 
by complementing approaches in which quantitative analysis is used to elucidate 
“design principles” underlying the functioning of natural intracellular networks. 
Moreover, synthetic systems provide excellent examples of the direct link between 
theoretical modeling and biological reality. 





13.1 Introduction 


During the last few decades, the ability to isolate, sequence, and manipulate DNA 
has led to tremendous advances in genetic engineering with numerous benefits to 
science, agriculture, and medicine. Typically, genetic engineering is used to endow 
a genetically modified organism with a novel trait, such as resistance to certain 
pesticides or the ability to efficiently synthesize pharmacological molecules, for 
example by transferring a gene from another organism. Gene therapy is another 
example. There, a trait lost due to a nonfunctional endogenous gene is typically 
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recovered by inserting a normal copy of the gene at a non-specific location within 
the genome. Synthetic biology can be viewed as a natural extension of such single- 
gene approaches in the sense that entire systems are inserted into the genome of 
the host cell. 

So far, efforts in synthetic biology have included the construction of novel gene 
regulatory networks, signal transduction pathways, metabolic pathways, synthetic 
multicellular systems, engineered sensory proteins, and the regulation of proteins 
that control intrinsic cell functions. An excellent introductory review of the many 
different aspects of synthetic biology is given by Benner and Sismour (2005). Here, 
we focus on the mathematical design and experimental implementation of selected 
synthetic gene regulatory networks that embody important architectural properties. 
Prerequisites for designing and implementing synthetic gene regulatory networks 
include understanding how transcriptional regulation works, how transcription 
factor proteins regulate the expression of each other within networks, and knowledge 
of recombinant DNA technologies. An excellent introduction to the latter is given 
by Nicholl (1994). 

General aspects of transcriptional regulation and how transcription is modeled are 
discussed in sections 13.2 and 13.3, respectively. The remaining sections highlight 
how synthetic gene regulatory systems have been designed and implemented in the 
bacterium Escherichia coli based on network models constructed from phenomeno- 
logical mathematical descriptions of transcriptional regulation. In sections 13.4 and 
13.5, we discuss linear transcriptional networks and feedforward networks, respec- 
tively. In section 13.6, we provide examples of networks that support bistability 
and oscillations by incorporating feedback control. These systems demonstrate how 
some of the principles investigated in chapter 6 have been used to create living cells 
with complex dynamical properties. 





13.2 


Transcriptional Regulatory Modules 


In order to engineer gene regulatory systems, it is necessary to appreciate some 
of the basic elements of gene regulation. Natural genetic circuits are typically 
described as circuits of interconnected modules consisting of interacting proteins, 
DNA, RNA, and small molecules that regulate the transcription of genes into 
mRNA, the translation of mRNA into polypeptides, and the biological activity 
of the expressed proteins. While the abundance of an expressed protein can be 
controlled by many different mechanisms, the regulation of gene transcription is 
one of the most common. In prokaryotes, this type of control is often mediated 
through transcription factor proteins that alter the ability of the RNA polymerase 
to bind to and initiate transcription from promoter regions located upstream of the 
regulated genes. 

Prokaryotic transcriptional regulatory modules often consists of four elements: a 
promoter region, the gene (or genes) expressed from that promoter, the transcrip- 
tion factor proteins that regulate the expression level, and additional regulatory 
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Figure 13.1 A. Architecture of a prototypical transcriptional regulatory module. B—C. 
Population-averaged signal-response curves measured in EF. coli cells expressing a reporter 
gene from LacI/Ppiaco and TetR/Piteto modules with fixed concentrations of LacI or 
TetR and varying concentrations of the inducers isopropyl-G-D-thiogalactoside (IPTG) 
and anhydrotetracycline (aTc), respectively. Based on experimental data from (Lutz and 
Bujard, 1997). 


molecules that modulate the activity of the transcription factors. A schematic lay- 
out of such modules is shown in figure 13.1A. The expression from the regulated 
promoter can be measured in single cells by expressing a reporter gene, such as 
the gfp, yfp, or cfp genes encoding green (GFP), yellow (YFP) and cyan (CFP) 
fluorescent protein, respectively (see chapter 10). 

Transcriptional regulatory proteins increase (activators) or decrease (repressors) 
the probability that a gene is transcribed into mRNA by binding to stretches of 
DNA within or near promoter regions referred to as operators or cis-regulatory 
elements (figure 13.1A). While activators may facilitate the binding of RNA poly- 
merase to the promoter, repressors often exert their function by competing with the 
RNA polymerase for promoter access. Transcription from a promoter containing ap- 
propriate cis-regulatory elements can thus be controlled by up- or down-regulating 
the cellular abundance of the corresponding transcription factor proteins. In some 
cases, external control over such in vivo signals is provided by small molecules 
called inducers. These molecules typically function by modulating the activity of a 
transcription factor protein. Specifically, when the inducer binds to the protein, it 
causes an alteration in its three-dimensional structure that increases or decreases 
the affinity between the protein and its cognate cis-regulatory elements. Varying 
the inducer concentration thus provides a means of regulating transcription without 
altering cellular protein abundances directly. 

Figures 13.1B and 13.1C illustrate how expression of a reporter protein from two 
engineered transcriptional regulatory modules, LacI/Pyjaco and TetR/Prteto, is 
modulated by the inducers isopropyl-G-D-thiogalactoside (IPTG) and anhydrote- 
tracycline (aTc), respectively. The Pyiaco and Prteto promoters are obtained by 
inserting lacO and tetO operator sequences, corresponding to the binding sites of 
LacI and TetR, respectively, into the Pe promoter normally repressed by the pro- 
tein CI. In both cases, the signal-response curve, in other words, the relationship 
between the regulatory input signal (the inducer concentration) and the output 
signal (the abundance of the reporter protein), is highly nonlinear and sigmoidal. 
The endogenous £. coli promoters Piac and Pet, which are repressed by LacI and 
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Table 13.1 ‘Transcription regulatory modules used frequently to construct synthetic 
gene regulatory networks in Æ. colt. TetR and LacI are repressors that are inactivated 
by their inducers. LuxR is an AHL-dependent activator. CI is generally a repressor, but 
activates transcription from the Pam promoter. 





Regulatory protein Regulated promoters Inducer 





TetR Pret; Puteto tetracycline 
Lacl Pac, Priaco, Pirc lactose, IPTG 
CI Pi, Pr, Pam, Puxor 

LuxR Pux, Puxor acyl-homoserine 


lactones (AHL) 





TetR, respectively, respond to induction in a similar fashion. Additional transcrip- 
tional regulatory modules used frequently in synthetic biology are summarized in 
table 13.1. 





13.3 Modeling Transcriptional Modules 


In the remaining sections of this chapter, we discuss how simulation and analysis 
of mathematical models have been employed to forward engineer E. coli cells with 
novel characteristics and sophisticated computational capabilities by interconnect- 
ing the modules in table 13.1 into larger networks. We use a convenient abstraction 
to model these biochemical networks with ordinary differential equations (ODEs) 
that include basal expression of a protein, protein decay, and Hill function descrip- 
tions of gene regulation (see chapter 6). 

In general form, the ODE that models the output Z of a genetic module given 
the regulatory input S is given by: 


dZ] y | ke (S"/K) 


a tne a (13.1) 


where the parameter u is used to distinguish between the cases of repression (u = 0) 
and activation (u = 1) of transcription by S' (see for example, Kuznetsov et al. 
(2005) for details). The constants K and n are the Hill constant and Hill coefficient, 
respectively. The Hill constant gives the value of the input signal that yields 50% 
response, and the Hill coefficient gives the slope of the signal-response curve at this 
input signal. The parameter d is the rate constant associated with the decay of the 
output reporter protein. Additionally, the parameters k and k’ are the rate constants 
associated with signal-independent (basal) and signal-dependent gene expression. 
The values of k and k’ are typically correlated, and this interdependence is often 
modeled by setting k’ = a-k with O < a < 1. With this relationship, the steady 
state solution of equation 13.1 is given by: 


k (S%/ K)“ 
[Z]ss = 7 («+ T -Si (13.2) 
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Hence, in steady state, the cellular abundance of the reporter protein reflects the 
relationship between the regulatory signal and the transcription rate modeled by 
the Hill function. 

It is noted that equation 13.1 models transcription and translation as a single 
step. Because the separation of transcription and translation introduces response 
delays, it can be important in models of temporal dynamics to include mRNA as an 
independent variable. In this case, a single-input transcriptional regulatory module 
is described by the ODEs: 





d[M] kir CS" / K)" 
a Re + garam) VA 
AZ = kuM]- d (2) (13.3) 


where [M] is the mRNA concentration, ky, is the rate constant associated with 
transcription, ky, is the rate constant associated with translation and dm is the 
mRNA decay constant. Equation 13.1 is obtained from equation 13.3 by invoking 
a steady state assumption for |M] and defining the constant k by k = ktrku/dm. 
Hence, modeling transcription and translation as a single step does not change the 
steady state solution in equation 13.2. 

Equation 13.1, or equation 13.3 when mRNA is included, are used to model both 
the effect of changing the intracellular concentration of a regulatory protein and 
the extracellular concentration of an inducer. For example, the effect of varying 
the concentration of a repressor R is modeled by setting the input signal equal 
to the repressor concentration, S = [R], with u = 0. When the concentration of 
the repressor is constant, the effect of varying the concentration of its inducer I 
is modeled by setting S = [J] and u = 1. For example, the steady state signal- 
response curves in Figs. 13.1B and 13.1C for induction of the LacI/Prjaco and the 
TetR/Prteto modules can be modeled using equation 13.2 with u = 1 and the 
concentrations of IPTG and aTc defining the signal S, respectively. Other input- 
output functions are also possible depending on the regulatory role of the protein 
and how the inducer affects the activity of this protein. In cases where both repressor 
and inducer concentrations vary, the signal S is the concentration of active repressor 
molecules. This signal is modeled by setting S = [Rr]/(A7’+[J]"’) with [Rr] being 
the total repressor concentration, and Ky; and nz the Hill constant and coefficient 
associated with the repressor-inducer interaction, respectively. 

For the purpose of network modeling, we will use the following notations: Each 
transcriptional regulator protein is given an index i = 1,2,...N. The concentration 
of the transcription factor protein is given by [R;], and its inducer, if present, by 
|]. The gene and mRNA that encode the regulatory protein R; are denoted r(i) 
and Mi, respectively. The rate constant associated with the decay of protein R; is 
given by d;. Promoters are identified as follows: P denotes a constitutively active 
promoter, P; a promoter regulated by the protein R;, and P;; a promoter regulated 
by the proteins R; and R;. The parameters characterizing the transcription from 
each promoter are identified by the same index as the promoter for the parameters 
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k and a (or k’), and by the protein index for the Hill constant K and the Hill 
coefficient n. 





13.4 Linear Networks 


Linear transcriptional regulatory networks consist of modules placed in series with 
the output of one module acting as the input to the next module. In the simplest 
case, a linear network is composed of two modules and one regulatory step. The 
Lacl/Priaco and TetR/PLteto modules discussed in section 13.2 are examples of 
such one-step transcriptional cascades because the transcriptional regulator (Ry, in 
figure 13.1) is expressed at high constant levels from a constitutively active pro- 
moter. For clarity, the constitutive promoter (that is, the first transcription module) 
is omitted from the diagram in figure 13.1. The construction and analysis of longer 
transcriptional cascades, which will be discussed next, is useful for determining how 
information flows through transcriptional networks and can help better understand 
the rules of module composition. For example, cascades comprised of two and three 
regulatory steps have been engineered with the purpose of investigating time delays, 
ultrasensitivity in signal-response relationships and stochacticity in transcriptional 
regulation (see, for example, Blake et al. (2003); Hooshangi et al. (2005); Rosenfeld 
et al. (2005); Pedraza and van Oudenaarden (2005)). 


13.4.1 Two-Step Cascades 


Figure 13.2 depicts the schematics of a two-step linear repressor cascade obtained 
by adding a third transcriptional module to the one-step cascade. The first module 
comprises the promoter P with no regulatory inputs; the second module the 
repressor R,, its inducer Jı, and the promoter Pı. The third module comprises 
the Rə repressor and the P> promoter. This configuration provides a mechanism 
to measure the behavior of the R2/P2 module. The constitutive promoter P drives 
expression of the R, repressor, which in turn, inhibits the expression of Rə. The 
inducer J, can thus be used to determine the input signal to the R2/P2 regulatory 
module by modulating the cellular abundance of repressor Rə. 

Using equation 13.1 with u = 1 and S = [Z] to model the R,/I,-dependent 
expression of Rə and with u = 0 and S = [R2] to model the inhibition by Rə of 
expression from P>, the ODEs describing the two-step linear network are given by: 








d[R] a A/K) l 
di = ay ky I a ((hi]/Kiy™ d2 [Ro] 
l 2 az- ke ko d- |Z] (13.4) 








dt “1+ ([R:]/K2)”? 
where the meaning of the parameters were defined in section 13.3. Notice that it is 
not necessary to include an equation for Rı because its steady state level is constant. 


As discussed in section 13.3, the combined regulatory activity of Rı and Jı can be 


13.4 Linear Networks 275 













(Re ea OE 
(R) gE 
PT QD SS 
xy 2 8 5 Toes 
n 
= H 3 B10 A 
ef pem 3 i 
as a PR 
£ 


( 


(R) 
P, l ai 
t ® 
P3 repressor concentration (nM) 


Figure 13.2 A. Architecture of a two-step repressor cascade. B. Population-averaged 
rates of reporter protein synthesis from the Pr promoter (black points) and the P% 
promoter (grey points) measured at the single-cell level using a two-step repressor cascade. 
The broken curves give the standard deviation associated with the measured synthesis 
rate, and full curves the fit to a Hill function with u = 0. The fitted parameter values are: 
n = 2.4£0.3, K =55+10 nM, k = 220+15 min“ for the Pg promoter, and n = 1.7+0.3, 
K = 120 +25 nM, k = 255+ 40 min“! for the Pë promoter (Rosenfeld et al., 2005). 

















captured phenomenologically in one Hill function to model the relationship between 
the inducer concentration and the expression from the regulated promoter. 
The steady state solution of equation 13.4 is given by: 


((a]/41)"" ) 
1+ (Ai a) 
7 (O4 Tear) 

Z ss = a 13.5 
[Z] d (2 t IF (flae (13.5) 
In terms of the overall response of this network, the steady state solution predicts 
that the presence of inducer (high input) results in repression of P> (high R2, low 
output) and the absence of inducer (low input) allows transcription from the P> 
promoter (low Ro, high output). 





[Rə]ss = (o 











13.4.2 Characterizing Module Input-Output Functions 


The network illustrated in figure 13.2 and described by the model in equation 13.4 
can be implemented using different repressor/promoter pairs. Elowitz and col- 
leagues (Rosenfeld et al., 2005) implemented a version using the aTc-inducible 
TetR/P,e module to characterize a CI/PR repressor module driving CFP (Rı = 
TetR, Iı = aTc, and Rə = CI, Z = CFP). In this implementation, the cI gene is 
fused with the yfp gene to synthesize a yellow-fluorescent variant of the CI protein. 
This dual-color labeling allows for simultaneous measurements of the input and out- 
put signals in single cells. Additionally, using time-lapse microscopy to determine 
the rate of change in fluorescence, the dependency of the rate of protein synthesis 
on the repressor concentration can be determined at the level of single cells. The 
system thus enables a direct investigation of the suitability of the Hill function in 
equation 13.1 as a model of the most fundamental signal-response relationship in 
gene regulatory systems. 
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Figure 13.3 A. Predicted effects on the signal-response curve of the two-step repressor 
cascade of decreasing the rate of repressor R2 synthesis (from kı = 1000 nM/min to 
kı = 100 nM/min) and increasing the Hill constant (from K2 = 10 nM to Kə = 100 nM). 
Other parameter values are a; = a2 = 0.02, dı = d2 = 0.1 nM/min, k2 = 20 nM/min, 
and n = 2. B. Experimentally measured signal-response in two-step repressor cascades 
containing altered ribosome binding sites (RBS) of repressor-encoding mRNA to change 
kı, or mutations in the regulated promoter (pM) to change Kə. Based on experimental 
data from (Weiss and Basu, 2002). 


Figure 13.3B illustrates the experimentally observed relationship between the 
concentration of the CI-YFP protein and the population-averaged rate of CFP 
synthesis for the Pr promoter and a variant of this promoter, designated P, where 
one of the CI binding-sites is mutated. Also included are the standard deviations 
associated with the average protein synthesis rates and the signal-response curves 
obtained by fitting the data to Hill functions. 


13.4.3 Matching Kinetic Characteristics 


While the two-step cascade composed of the TetR,/ Piet and CI/Pr modules exhibits 
a useful inverse sigmoidal signal-response relationship, it is often the case that 
coupling transcriptional regulatory modules does not yield the desired behavior. 
Another version of the same network uses the LacI/ Pac pair as the inducible module 
to control the input to the CI/Pg module. However, when initially assembled, no 
fluorescence was observed from cells harboring the network regardless of whether the 
inducer, in this case IPTG, is absent or present. Apparently, even with maximum 
repression of the Pac promoter, CI is synthesized at a sufficiently high level to 
fully repress transcription from the PR promoter. Unfortunately, our models are 
presently not sufficiently accurate to predict such mismatch problems partially 
because accurate in vivo parameter values are difficult to obtain. Hence, it is often 
necessary to first construct a network, and then use modeling tools to guide the 
correction and fine-tuning of its behavior. 

In order to overcome impedance mismatch problems, one can mutate genetic 
elements until the desirable network response is obtained. Starting with a non- 
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functional or non-optimal network, such mutations can be introduced to affect 
biological parameters identified by model analysis as most likely to yield the desired 
behavior. For example, Feng et al. (2004) showed how to use global sensitivity 
analysis to determine the best genetic targets for mutations that could make 
the two-step LacI/Piac, CI/PR cascade functional. The steady state model in 
equation 13.5 predicts, as shown in figure 13.3A, that decreasing the value of the 
maximal repressor synthesis rate kı or increasing the Hill constant Xə should confer 
a non-responsive network with the desired network properties. The Hill constant 
can be modified by mutating one of the Cl-binding sites within the PR promoter to 
lower the Cl-binding affinity, and the maximal CI synthesis rate kı can be changed 
by mutating the ribosome-binding site (RBS) on the Cl-encoding mRNA. 

That the model correctly predicts the genetic mutations required to obtain a 
functional network is shown in figure 13.3B. The experimental results are obtained 
with three different cL RBS sequences yielding lower translation efficiencies than 
the original RBS (Weiss and Basu, 2002). The plots show that the systems with 
the weakened RBS are able to respond to induction with IPTG, in agreement with 
the model predictions. Also shown are the effects of introducing mutations into the 
Cl-binding site within the PR promoter. These mutations are combined with the 
weakest RBS in order to optimize the response. 


13.4.4 Interfacing Transcriptional Modules 


Once the kinetic characteristics of the individual transcriptional regulatory modules 
are appropriately matched, they can be coupled together into larger networks. 
This can be accomplished by combining modules at random (Guet et al., 2002) 
or rationally to achieve a specific network property. Perhaps the simplest extension 
of the two-step cascade is to add an additional repressor module to form a linear 
three-step network (figure 13.4A). The experimental investigation of this cascade 
highlights interesting properties that are important for the understanding of the 
more complex systems discussed in the sections to follow. 

An implementation of the three-step linear repressor cascade uses the TetR/ Prteto 
module as the inducible input component and the LacI/Pac and CI/Pr modules 
as the first and second repressor module, respectively (Hooshangi et al., 2005). 
Figure 13.4B shows the experimentally measured population-averaged steady state 
network outputs at varying concentration of the aTc inducer when a fluorescent 
reporter is expressed from the Pyteto (Pi), the Pac (P2), and the Pr (P3) pro- 
moter, respectively. These population-averaged protein abundance curves have the 
correlations expected for the network. While the expression from the P) and P; 
promoters show a positive correlation with the input aTc concentration (that is, 
high-pass detection), expression from P shows a negative correlation (that is, 
low-pass detection). When fitted to a Hill function, the Hill coefficients for the 
steady state response in the cascades of length one, two, and three are 2.3, 7.0, 
and 7.5, respectively. In other words, increased length of transcriptional regula- 
tory cascades improves the sensitivity to the input signal by enabling more pro- 
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Figure 13.4 A.Architecture of the three-step repressor cascade. B. Population-averaged 
steady state expression levels obtained by expressing a fluorescence reporter gene at 
different steps in the cascade (P; = Prteto, P2=Piac, P3=Pr) when the concentration of the 
inducer of the first transcriptional module (aTc) is varied. C. Time course of population- 
averaged expression levels at the different steps in the cascade following induction. D. 
Relative population heterogeneity (standard deviation over the mean) at steps one and 
three in the cascade following induction. Based on experimental data from (Hooshangi 
et al., 2005). 


nounced all-or-nothing steady state responses. This phenomenon can also be found 
in naturally occuring regulatory motifs such as signal transduction phosphorylation 
cascades (Ferrell Jr., 1996). 

It is also interesting to compare the time course of expression induction at the 
different steps in the cascade following aTc induction. This is done in figure 13.4C. 
While protein synthesis from the first promoter begins immediately after addition 
of aTc, there is a significant time lag in the repression and activation of the second 
and third promoter, respectively. The abundance of the protein expressed from 
Preto (P1) reaches the 50% of maximal abundance after ~15 minutes, and it takes 
about 200 and 300 minutes for the proteins expressed from the Pac (P2) and the 
Pr (P3) promoters to pass the 50% mark. While a model based on equation 13.1 
predicts such delays, the experiments give an idea of the relative time scale involved 
in transcriptional regulation and the response-delay introduced as the regulatory 
signal propagates through the network. Specifically, the cell division time for E. coli 
is typically ~45-120 minutes depending on the strain and the growth conditions, 
meaning it may take several generations for a full transcriptional response to be 
realized. 

Another important observation that can be deduced from the time series experi- 
ment is that the regulatory signal propagates through the cascade at very different 
rates in individual cells. Figure 13.4D compares the relative variability in fluores- 
cence among cells, measured as the standard deviation over the mean, and changes 
following induction in the cascades of length one and three. While the cell-to-cell 
variability changes little as time progresses for the one-step cascade, indicative of 
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a fairly homogenous response, it changes significantly for the three-step cascade 
and reaches a peak value after about 200 minutes. At this time, which roughly 
corresponds to the point where expression from the PR promoter is initiated, the 
cell population is highly heterogeneous. Hence, the increased steady state sensitiv- 
ity in the longer cascade comes at the cost of a response that initially is highly 
asynchronous. An implication of this in terms of regulatory robustness is discussed 
further in section 13.6 in the context of feedback networks. 





13.5 Feedforward Networks 


Genetic feedforward networks are circuits in which transcriptional regulatory mod- 
ules are configured with a common input that propagates through parallel cascades, 
and ultimately converge to regulate a shared downstream promoter. Several endoge- 
nous feedforward motifs have been documented (Lee et al., 2002) and three-gene 
networks with this architecture appear more frequently in cellular regulation than 
expected based on randomized networks (Shen-Orr et al., 2002). Modeling predicts 
that the three-gene feedforward networks support a variety of properties ranging 
from transcriptional response delay and filtering to the generation of transient pulses 
of gene expression (Mangan and Alon, 2003). Here, we limit our discussion to feed- 
forward networks engineered in E. coli by interconnecting transcriptional regulatory 
modules in table 13.1. The first feedforward network (section 13.5.1) is composed of 
three genes and is designed to generate a transient pulse in response to a persistent 
inducing signal. The second network is composed of five genes and enables cells to 
respond to an inducing signal when the inducer concentration is within a specific 
range (section 13.5.2). 


13.5.1 Pulse-Generating Network 


When the downstream promoter in a feedforward network receives both an activat- 
ing and a repressing signal, a transcriptional pulse can be generated if the repressing 
signal is delayed compared to the activating signal. Such a delay is realized if the 
repressing signal has to propagate through a higher number of transcriptional mod- 
ules than the activating signal (see figure 13.4B). Hence, the feedforward network 
depicted in figure 13.5A should be able to generate a gene expression pulse. In 
this network, an inducing input signal (S1) activates the transcription of a reporter 
gene from a multi-input promoter (P;2) and as well as the expression from the P, 
promoter of a repressor (R2) of the Piz promoter. 

Ignoring basal expression and modeling the expression from the P;z promoter as 
a product of an activating and a repressing Hill function, the feedforward network 
can be described by the following ODEs: 








d| Rə] kı 2 gM 
= - d2- [R 
dt 1+ s™ 2° [Ra] 
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Figure 13.5 A. Architecture of the pulse-generating network. B—D. Simulations of the 
network model demonstrating the effect of changing parameters (panel B) at full induction 
(H = 1), the level of induction (panel C) and the rate of inducer accumulation (panel 
D). Unless otherwise indicated, parameter values are (in nM/min): kı = 5, ki2 = 20, 
dz = 0.01, d = 0.04, (in nM): Kı = 1, K2 = 100, and nı = nə = 3. Inducer accumulation 
is modeled by setting s(t) = ks -t with ks being the rate of accumulation. 


where s = S;/K, is the inducing signal of P, (that is, dependent on the inducer 
concentration [I;]). As before, it is not necessary to include the concentration of 
the Rı protein because its concentration can be assumed constant. 

Without resorting to computer simulations, let us see if we can generate intuition 
about the network dynamics directly from the ODEs. To do this, we define the 
induction level H as H = s™ /(1+ s”) and find the steady states of the system. 
They are given by: 





ky 
as == T H 
[Rə] PF 
k K? . H 
Zhe = E (13.7) 


d K? + [RJ] 


Let us consider the case where the induction level is constant and the lifetime of 
the repressor is so long that its decay can be assumed negligible. In this case, the 
accumulation of repressor following induction at t = 0 is given by [Rə] (t) = kı- H-t. 
This reduces equation (13.7) to a time-dependent ODE for the output concentration 
that is given by: 


dz] = ky 
dt 1+ (ki-H-t/Ko)™ 





d- [Z] (13.8) 


This equation captures the initial high transcription rate from Pi2, which leads to 
an overshoot of the steady state in equation (13.7), and the decrease in this rate as 
repressor accumulates with time. It also indicates that the duration of the pulse (and 
hence its magnitude) is linked to the maximal rate of repressor synthesis kı, the 
Hill constant Kə and the concentration of the induction level H. Specifically, a 50% 
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decrease in transcription occurs after a time period given by to.5 = K2/(kı - H). 
Hence, increasing Kə or decreasing kı is predicted to cause a longer pulse and 
higher amplitude. This prediction is validated by the model simulations presented 
in figure 13.5B. 

Suboptimal induction, that is, values of H less than one, is also predicted to 
increase the duration of the pulse. However, because the rate of output protein 
synthesis depends on the induction level, suboptimal induction is expected to 
decrease the amplitude. Additionally, because the length and the amplitude of the 
pulse depend on how fast the repressor accumulates, they should also depend on 
the rate at which the inducing signal accumulates. These predictions are validated 
by the model simulations presented in figure 13.5C and figure 13.5D, respectively. 

The pulse-generating network is implemented experimentally by expressing the 
CI repressor from the AHL-activated Piux promoter and GFP from the multi-input 
promoter designated Puxor. This promoter is obtained by inserting a CI binding 
site into the Pux promoter to achieve repression of AHL-activated transcription by 
CI (Basu et al., 2004). The experimental observations reflect well the results of the 
above analysis. Figure 13.6A shows population-averaged temporal responses of four 
E. coli strains harboring networks constructed with different rates of CI synthesis 
(cI-RBS mutations) and different binding affinities of CI to the Puxor promoter 
(operator mutations) following induction with saturating AHL concentrations. It is 
seen that the effects of the mutations are in agreement with the model predictions. 
Due to a high repressor synthesis rate (high kı) and strong repressor binding to the 
Puxor promoter (K2), a pulse is not generated in the original network. Mutations 
that decrease the repressor synthesis rate or the operator binding strength yield a 
pulse with intermediate duration and amplitude. The best network performance is 
obtained when these mutations are combined. 

In a second set of experiments, the temporal response was measured after 
induction with different AHL concentrations using the network with the best 
performance. As shown in figure 13.6B, at AHL concentrations below 47 nM, the 
pulse amplitude is decreased and its duration shortened. At an AHL concentration 
of 4.7 nM, the pulse can hardly be observed. In other words, the system responds 
differently at nonsaturating AHL concentrations as predicted by the model analysis. 
A third set of experiments measured the network response to different rates of 
AHL accumulation. The results are shown in figure 13.6C. As the rate of AHL 
accumulation is decreased, the onset of the pulse is delayed, and its amplitude 
decreased. This is also in agreement with the model prediction. 

The experimental results in figure 13.6A—C are population-averaged responses 
obtained in a well-mixed environment. This leaves open the question of how cells 
harboring the feedforward network respond in an environment where signal diffusion 
plays an important role. Figure 13.6D illustrates the results of an experiment 
designed to determine the spatio-temporal response at the level of single cells. 
In these experiments, cells harboring the pulse-generating feedforward network are 
placed adjacent to E. coli cells that synthesize and emit AHL. The AHL-emitting 
“sender” cells harbor an aTc-inducible promoter controlling the expression of the 
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Figure 13.6 A-C. Experimental validations of the model predictions in a pulse- 
generating network activated by AHL. The effect of (A) introducing mutations to change 
the kı and K2 parameters, (B) different inducer concentrations, and (C) changing the rate 
of inducer accumulation. Based on data from (Basu et al., 2004). D. Responses of pulse- 
generating cells placed at different distances from nearby sender cells to AHL synthesized 
by the senders. Notice that the response of cells in position 2, which is farther away from 
the senders, is delayed, and the maximum pulse amplitude is diminished. 


enzyme (LuxI) that synthesizes AHL from common metabolites. As a result, the 
sender cells produce AHL when treated with aTc. The inducer subsequently diffuses 
into the environment and establishes an AHL concentration gradient. Figure 13.6D 
shows the phase-contrast and fluorescence microscopy images of “receiver” cells 
harboring the feedforward network taken at different time points and different 
distances from a colony of AHL-emitting senders. While there is distinct variability 
in the response from one cell to another, it is seen that single cells respond to 
increased AHL by generating a pulse of fluorescence. Moreover, AHL-induction 
elicits a response in receiver cells that depends on the distance from the senders. 
Because of AHL diffusion, the rate of AHL accumulation is slower farther from 
the AHL-emitting source. This allows receiver cells to differentiate between signals 
originating from nearby and distant senders. 


13.5.2 Concentration Band Detection 


The pulse-generating system discussed in the previous section is an example of the 
complex responses that can be generated from transcriptional networks combining 
one-step and two-step linear cascades in a feedforward architecture. In this sec- 
tion, we investigate a network in which a two-step and a three-step cascade are 
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Figure 13.7 Architecture of the band detection feedforward seer obtained by com- 
bining a high-pass two-step linear repressor cascade and a low-pass three-step linear re- 
pressor cascade. This combination enables the synthesis of the reporter protein only when 
the inducer J; is within a specific range. 


activated by the same input and regulate the same output. The resultant five-gene 
feedforward system activates the expression of the output gene within a finite con- 
centration range of the inducing signal, that is, concentration band detection, and 
supports the formation of spatial patterns in response to a gradient in the inducing 
signal. These experiments demonstrate a mechanism referred to as the “French flag 
model” in developmental biology (Wolpert, 2002) where cells read and respond to 
spatial information encoded in a “morphogen” gradient by having sharp induction 
thresholds. 

Figure 13.7 illustrates the schematics of the five-gene feedforward network. Along 
the three-step branch, the inducing signal S1, which is generated by a combination 
of the regulator Rı and its inducer [,, activates the expression of the Rə, which, 
in turn, inhibits the expression of the R3 repressor. In the final step, the repressor 
Rs inhibits the transcription of the reporter protein. Along the two-step branch, 
the inducing signal activates the expression of the repressor designated R3, which 
is functionally equivalent to the Rg repressor (that is, it also inhibits the expression 
of the reporter protein). 

How will the system respond to different levels of the inducing signal? To answer 
this question, we look at the steady state concentration of the output reporter 
protein. Since Rg and R3 are assumed to be functionally identical, it is given by: 
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where k{ is the rate of R3 expression from the Př promoter, and the inducing signal 
Sı is expressed relative to the value that yields 50% response (s = S1/ Kı). 
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Along the two-step branch, the steady state concentration of R3 rises as the 
inducing signal increases and the expression of the output protein is inhibited when 
the inducing signal is high. This branch acts as a low-pass detector (see figure 13.4). 
The steady state concentration of Rg in the three-step branch shows the opposite 
correlation and acts as a high-pass detector. For appropriately matched parameter 
values, this may leave a gap in the concentration of the repressors of transcription 
from P; at intermediate values of the inducing signal. It is in this gap that the 
output protein is synthesized. 

The boundaries in the concentration of the inducing signal between which the 
output is expressed can be obtained from the steady states in equation 13.10 as the 
values Slow and Spign where R3 and Rg have 50% of their maximal concentrations, 
respectively. They are given by: 


d2 K2 
Stow = Ki Sin = Ka "Y 13.11 
1 1 Shieh Ue ae ( ) 


Therefore, a gap in the total concentration of repressors may occur if kı > 2Kod9. 
If a sufficient gap exists, the range of band detection can be shifted by modifying 
the value of Ky. 

The five-gene feedforward network is implemented experimentally (Basu et al., 
2004) using the AHL-activated LuxR/ Pix module to regulate the expression of the 
CI protein (R2), which, in turn, regulates the expression of the LacI protein (R3) 
from the Pg (P2) promoter. The system output protein is expressed from the Pac 
(P3) promoter. These transcriptional regulatory modules comprise the three-step 
branch of the network. Along the two-step branch, AHL activates transcription of 
a variant of the lacI gene, designated lacI™", that differs in its DNA sequence from 
that of the lacI gene, but encodes a protein with the same amino-acid sequence. 
The protein product encoded by lacIM! (Rš) is thus functionally identical to Lacl. 

The band detection network is implemented in different versions: one using 
the wild-type LuxR protein, designated BD2, the other, designated BD1, with a 
mutant variant of LuxR that is hypersensitive to AHL. In the latter, less AHL is 
required to achieve the same expression level from the Piyx promoter, corresponding 
to a decreased value of the Hill constant Kı. Accordingly, the range of AHL 
concentrations detected by the two versions should be different, with the mutated 
LuxR network expressing the output protein of the system at a lower inducer 
concentration. Figure 13.8A and figure 13.8B show that this differential response 
is also observed experimentally. In figure 13.8A, the measured steady state input- 
response of the two-step branch is shown at varying AHL concentrations. It is seen 
that the AHL concentration yielding 50% response is decreased more than 10-fold 
when the hypersensitive LuxR variant is employed. In figure 13.8B, the observed 
steady state input-response of the five-gene band network is shown at varying AHL 
concentrations. As predicted by the model, BD1 cells activate the expression of the 
system output at a lower range of AHL concentration than BD2 cells. 

The different ranges of AHL detected by strains harboring the different band 
detection networks enable multicellular pattern formation. Figure 13.8C shows the 
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Figure 13.8 A. Population-averaged signal-response curves for cells harboring two 
versions of the two-step branches with a wildtype and a hypersensitive LuxR mutant, 
respectively. B. Signal-response curves for the five-gene feedforward network with the 
wildtype LuxR (strain BD2) and the mutant LuxR (strain BD1). C. Formation of a 
target pattern within a bacterial lawn containing a mixture of the BD1 and BD2 strains 
in the presence of AHL-emitting cells in the center of the lawn. Data from (Basu et al., 
2004). 


formation of a target pattern in an experiment where an AHL-emitting cell strain is 
grown at the center of a lawn containing a mixture of BD1 and BD2 cells. The BD2 
cells turn on the expression of a fluorescent reporter gene at a short distance from 
the AHL-emitting cells, but remain quiescent farther away. On the other hand, the 
BDI cells are quiescent near the center of the lawn and express a differently colored 
fluorescent reporter only at a distance from the AHL-emitting cells. 





13.6 Feedback Networks 


The experiments involving feedforward networks demonstrate how complex dynam- 
ics can be generated by combining linear signaling cascades. In natural regulatory 
systems, such behaviors are frequently generated in networks incorporating feed- 
back loops as an additional control feature. As discussed in chapter 6, feedback 
control enables complex dynamics, such as bistability, hysteresis, and oscillations. 
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There are numerous examples of such behaviors in natural genetic circuits. For 
example, LacI is a key component of a natural genetic feedback network that ex- 
hibits bistability (Ozbudak et al., 2004). There are also many examples of gene 
regulatory feedback networks supporting dampened or sustained oscillations. They 
include, for example, the circadian clocks discussed in chapter 2 and chapter 12, 
and the Mdm2-p53 network discussed in chapter 6. A motivation for the implemen- 
tation of synthetic gene regulatory feedback systems is to complement the analysis 
of mathematical models of natural circuits with investigations of how feedback net- 
works behave in vivo. In this section, we discuss several genetic feedback networks 
implemented in EF. coli to create cells capable of complex temporal dynamics and 
the mathematical models used to design or to understand the network properties. 


13.6.1 Bistable Networks 


Bistability and hysteresis are trademark features of networks that contain positive 
feedback or autocatalysis. Here, we investigate two single-gene positive feedback 
networks giving rise to hysteresis (Atkinson et al., 2003) and bimodal population 
distributions (Isaacs et al., 2003), respectively, and a two-gene system designed 
to operate as a bistable genetic toggle switch (Gardner et al., 2000; Kobayashi 
et al., 2004). In the single-gene positive feedback system depicted in figure 13.9A, 
a transcription activator Rı binds to its own promoter and increases the rate of its 


own synthesis. This network can be described by the ODE: 


d[Rı] 
dt 








dı - [Ra] (13.12) 


where the parameter y is a measure of the feedback control strength. Because at this 
point we are interested in using the model to reveal general trends, it is useful for 
the analysis to introduce new dependent variables to reduce the number of unknown 
parameters. For equation 13.12, a useful normalization is to use the dimensionless 
concentration rı defined by rı = [Rı]/Kı and dimensionless time r defined by 
T = dı : t. This corresponds to expressing the protein concentration relative to 
that yielding 50% response and time relative to the protein lifetime, regardless of 
the actual value of these parameters. Using the chain rule, the normalized form of 
equation 13.12 is obtained as 

drı kierr 


ap A 





-r (13.13) 


where «x; is defined by «sı = kı/Kı/dı. Similarly normalized equations will be used 
in the remaining sections of this chapter. 

Figure 13.9B shows a bifurcation diagram obtained by plotting the steady state 
solutions of equation (13.13) as a function of the feedback control strength. The 
steady state curve has the “S’-shape characteristic of bistable systems. At low 
feedback strength, there is little or no activation, and expression occurs essentially at 
basal levels. At high feedback strength, the promoter is more or less fully activated 
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Figure 13.9  Bistability in single-gene autocatalytic networks. A. Network architecture. 
B. Bifurcation diagrams for the network model. Full and broken curves indicate stable 
and unstable steady states, respectively. The saddle-node (SN) bifurcations are located 
where the stable and unstable steady states collide. Parameter values are: a = 0.1, k = 5, 
and n = 3. C. Bistability and hysteresis observed in a NtrC positive feedback network 
when the strength of NtrC-activated transcription is varied with IPTG. Closed and open 
squares correspond to cells initially in the low and high states, respectively. Based on 
data from (Atkinson et al., 2003). D. Transitions between uni- and bimodal population 
distributions observed in a CI/ Pram feedback network corresponding to different strengths 
of the feedback loop. Based on data from (Isaacs et al., 2003). 


and expression takes place at a rate close to maximal. These two states co-exist 
at intermediate values of the feedback control strength parameter with the region 
of bistability demarcated by two saddle-node bifurcations located at a value of y 
slightly above 0.3 and just just shy of 0.8. 

Several synthetic single-gene autocatalytic gene networks have been constructed 
with the purpose of generating bistability (Becskei et al., 2001; Isaacs et al., 2003; 
Atkinson et al., 2003). One system (Atkinson et al., 2003) is constructed such 
that the transcription factor NtrC activates its own expression from a modified 
NtrC-responsive Pama promoter and that of a reporter gene from the promoter of 
the ginK gene (Psink). The modified P,14 promoter is engineered such that the 
ability of NtrC to activate transcription is attenuated by LacI. This is achieved by 
inserting LacI binding sites such that the repressor competes with the activator 
for promoter access. This allows for an indirect means of modulating the feedback 
control strength using IPTG. In cells that express LacI at high levels, increasing 
the IPTG concentration enables more efficient activation by NtrC of transcription 
from the modified Peina promoter. 

Figure 13.9C shows the experimentally observed effect of varying the feedback 
control strength on the population-averaged expression of cells harboring the IPTG- 
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sensitive, NtrC positive feedback system. They are in excellent agreement with the 
model predictions. When grown in the absence of IPTG (that is, low feedback 
strength), cells express the reporter protein at low levels because of efficient re- 
pression by LacI. When these cells are exposed to increased inducer concentrations 
(closed squares in figure 13.9C), the measured signal-response curve is fairly flat 
below a critical IPTG concentration where the reporter expression changes sharply 
from low to high levels. On the other hand, when cells initially grown with high 
IPTG concentrations and fully activated (open squares) are exposed to decreased 
concentrations of IPTG, expression levels remain high until a critical concentration 
where a sharp transition to low expression is observed. At identical intermediate 
IPTG concentrations, corresponding to intermediate strength of the feedback con- 
trol, cell populations adopt a high or a low expression state depending on the initial 
conditions. Hence, the network endows cells with the ability to support bistability 
and hysteresis. 

Another single-gene positive feedback system engineered to display bistability 
(Isaacs et al., 2003) employs the Cl-activated Pam promoter to control the ex- 
pression of a mutated cI gene (designated cI857) encoding a temperature-sensitive 
variant of the CI protein. The CI protein also activates the transcription of a GFP- 
encoding gene allowing the measurement of gene expression at the level of single 
cells. The temperature-dependent activity of the cJ857-encoded CI protein enables 
modulation of the feedback strength through temperature variation. The activity 
of the CI variant decreases with increased temperature. Hence, a low temperature 
corresponds to a high feedback strength and high temperature corresponds to a 
low value of this parameter. The model thus predicts a low expression state at 
high temperature, a high expression state at low temperature, and bistability at 
intermediate temperatures. Figure 13.9D illustrates the population-distribution of 
fluorescence from cells harboring the CI/Prm feedback network at three different 
temperatures. For low and high temperatures, the population distributions contain 
a single peak and cells are in a high state when they are grown at low temperature 
and in a low expression state when they are grown at high temperature, respec- 
tively. At the intermediate temperature, the population-distribution is bimodal, 
indicating that cells transition frequently between the low and the high expression 
states due to noise-induced transitions. A detailed model of the circuit where the 
deterministic equations are augmented with stochastic terms accounts well for the 
observed distributions (Isaacs et al., 2003). 

The toggle switch network, which is illustrated in figure 13.10A, is an example of 
a two-gene system designed and implemented to allow E. coli cells to be switched 
between two distinct expression states in response to external stimuli. This system, 
which represents a multi-component motif (Lee et al., 2002) with indirect positive 
feedback, is composed of two genes encoding transcription factor proteins, Rı 
and Rə, that inhibit each other’s expression. Because of this mutual repression, 
the network can be either in a state with high R, expression and repressed Rə 
transcription, or in a state with high Rə expression and repressed FR, transcription. 
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Figure 13.10 The bistable toggle switch. A. Architecture of the network. B. Two- 
dimensional bifurcation diagrams indicating the boundaries between bistable and monos- 
table regions in the k1, k2 parameter space at different values of the Hill coefficients. 
C-—D. Transitions between high and low expression states in the pTAK and the pIKE 
toggle switch networks, respectively. The networks, pTAKAc/, pTAKAlacl, pIKEAtetR, 
and pTAKAlaclI, are controls in which one of the repressor genes is eliminated. Based on 
data from (Gardner et al., 2000). 


The experimental implementation of the toggle switch network is guided by the 
analysis of the dimensionless ODEs (Gardner et al., 2000): 


dry K2 dra Kı 
= 1G = 
dt 1+r5? oo dr 1+ri' 





where the repressor concentrations are expressed relative to the appropriate Hill 
constant and time relative to the protein lifetime (which is assumed to be the 
same for the two repressors). Conditions that make bistability more likely are 
high maximal expression levels (that is, high values of xı and «2) and high Hill 
coefficients. This can be seen from the bifurcation diagrams in figure 13.10B, which 
show the location of saddle-node bifurcations, that is, the boundaries between 
mono- and bistability, in the k1, «2 parameter plane for different values of the Hill 
coefficients. Increased values of the Hill coefficients enlarge the bistable region in 
the «1, k2 parameter space, and an increased value of one of the maximal expression 
rates allows for bistability in a wider range of values of the other. 

The toggle switch network is implemented experimentally using different tran- 
scriptional regulatory modules. Gardner et al. (2000) constructed two versions; one 
employs the LacI/P,rc and the TetR/Piteto modules and is designated as pIKE. 
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Figure 13.11 Extending the toggle switch network. A. The pTAK toggle switch is 
augmented with the AHL-activated LuxR/Piux to convert the network into a AHL sensor. 
B. Flipping between stable expression state by application of AHL to activate LacI 
expression or IPTG to activate CI expression. C. Hysteresis is observed when cells initially 
in the low (open circles) or high (closed circles) expression states are exposed to AHL at 
varying concentrations. Based on data from (Kobayashi et al., 2004). 


The other employs the LacI/P,,. and the temperature-sensitive CI/ P modules and 
is designated as pTAK. The expression state is monitored by co-expressing GFP 
with the cI857 gene (pTAK) or the tetR gene (pIKE). Two variants of the pIKE 
network with strong and weak lacI-RBS sequences demonstrate that the maximal 
rate of expression, in agreement with model predictions, is an important parameter 
for the emergence of bistability. Figure 13.10B shows the effect of treating cells in 
the high Lacl state with IPTG to activate TetR expression. Both variants respond 
by expressing the reporter protein. When IPTG is removed, cells harboring the 
network with the weaker /acI-RBS maintain the high expression state while those 
harboring the variant with the stronger JacI-RBS revert to the low expression state. 
The cells that remain in the high TetR state require addition of aTc to reactivate 
LacI expression. Hence, only the network with the weak lacI-RBS supports bista- 
bility. Four variants of the pTAK network, also with different RBS sequences, all 
exhibit bistability. This is shown in figure 13.10C. Addition of IPTG to inhibit Lacl 
induces a transition to a high expression state (low LacI/high CI), and the latter is 
maintained when IPTG inducer is removed. A subsequently applied transient tem- 
perature increase (to deactivate CI) induces a transition back to the low expression 
state (high LacI/low CI). 
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In an extension to the toggle switch (Kobayashi et al., 2004), mathematical 
modeling is used to guide the experimental implementation of “programmable” cells 
in which the pTAK toggle responds to signals from other gene regulatory networks. 
In one of the implementations, the LuxR/ Pix module is used to drive additional 
synthesis of LacI. The resultant five-gene network is depicted in figure 13.11A. 
In this system, the toggle switch can be flipped back and forth by adding AHL 
to increase the LacI synthesis rate and IPTG to increase the CI synthesis rate. 
This is illustrated in figure 13.11B, which shows population-averaged expression 
levels following induction with AHL and IPTG. Cells initially switch to the high 
expression state (high CI) following IPTG treatment and remain in this state when 
IPTG is removed. A transition to the low expression state (high LacI) occurs when 
these cells subsequently are treated with AHL, and the low state is maintained 
when AHL is removed. These cells are still responsive, and a second treatment with 
IPTG induces a transition to the high expression state. 

The system in figure 13.11A also supports hysteresis. Figure 13.11D shows the 
result of an experiment where cells that were initially prepared in the high CI 
state (high fluorescence) or the high LaclI state (low fluorescence), respectively, are 
exposed to AHL at varying concentrations. The high LacI state is, as expected, 
unaffected by AHL, and cells remain in the low fluorescence state regardless of the 
AHL concentration. However, the high CI state is sensitive to AHL. At inducer 
concentrations less than 20 nM, cells remain in the high expression state. On 
the other hand, at inducer concentrations higher than 40 nM, all the cells have 
switched to the low expression state. At intermediate inducer concentrations, the 
cell population contains a mixture of cells in high and low expression states. This 
bimodal response presumably arises from a combination of differences in induction 
threshold and noise-induced transitions, which are more likely to occur when the 
system is closer to the saddle-node bifurcation. 


13.6.2 Oscillatory Networks 


The discussion in the previous section provides examples of two complex properties, 
bistability and hysteresis, supported by genetic networks incorporating feedback 
regulation. Other complex behaviors that arise in feedback control systems are 
dampened and sustained oscillations. An example of a synthetic gene regulatory 
system capable of generating oscillations is obtained by adding a negative feedback 
to a three-step linear repressor cascade. The resultant system, which is referred 
to as the Repressilator (Elowitz and Leibler, 2000), is illustrated schematically in 
figure 13.12A. In the network, repressor Rı inhibits the expression of repressor Rə, 
repressor Rə inhibits the expression of repressor R3, and repressor Rg inhibits the 
expression of repressor Ra. 

The implementation of the Repressilator network is based on the analysis of 
a model describing the dynamics of repressor mRNA and protein concentrations. 
The concentration of mRNA is included explicitly because the separation of tran- 
scription and translation contributes to a response delay that is important for the 
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emergence of oscillations. Hence, equation 13.3 is used as the basis of the network 
model rather than equation 13.1, which, as described in section 13.3, assumes that 
the mRNA concentration is always in a steady state. To ease the analysis, it is 
assumed that each transcriptional module is characterized by the same set of pa- 
rameters and that the rate constant associated with translation is equal to that 
associated with repressor decay. With these assumptions, the network is described 
by the following dimensionless equations (Elowitz and Leibler, 2000): 
dm, K dr; 


= ak 4 i so Selma — ri 13.1 
r nti Ir e(mi — ri) (13.15) 





where m; and r; represents the concentration of repressor R; mRNA and protein, 
respectively, and rj the concentration of repressor Rj regulating the expression of 
repressor R;. The parameter £ is proportional to the ratio of the mRNA and the 
protein lifetimes. 

What makes sustained oscillations possible in the Repressilator network? The 
answer to this question can be obtained by considering the dynamics of the linear 
three-step repressor network in section 13.4.4. Recall that in the linear network, 
the transcription of the Rı repressor leads to increased expression from the P} 
promoter after a time delay. In the Repressilator, the Rı repressor is expressed 
from the P} promoter. Hence, the network can be viewed as a negative feedback 
system with time delay (transcription of R; from the P} promoter eventually causes 
down-regulation of its own expression). The time delay in repression allows for 
the accumulation of protein product beyond the steady state level and, when 
the repression kicks in, for the subsequent decay in protein concentration. The 
mechanism causing oscillations in the Repressilator network is thus somewhat 
analogous to that leading to circadian clock oscillations as discussed in chapter 2 
and chapter 12. 

The conditions that make oscillations more likely to occur are high Hill coeffi- 
cients, low basal expression levels, and short protein lifetimes. This is illustrated 
by the bifurcation diagrams in figure 13.12B, which show the location of the Hopf- 
bifurcations in the «K, € parameter space that separate regions of oscillatory and 
steady state dynamics for different sets of parameter values. When the Hill coeffi- 
cients are low and basal expression high, oscillations occur at intermediate values 
of & when e is greater than a critical value. In this case, the protein lifetime needs 
to be comparable to that of mRNA in order for oscillations to occur. Decreasing 
the basal expressions rate and increasing the Hill coefficients relax this requirement 
and allow oscillations to occur for a broader range of & and € values. 

The Repressilator network is implemented experimentally by interconnecting the 
Lacl/Priaco, the CI/Pr, and the TetR/Prteto modules such that LacI (R1) is 
expressed from Pr (P3), TetR (R2) is expressed from Pyiaco (Pi), and CI (Rg) is 
expressed from Pyteto (P2). To decrease the lifetime of the repressor proteins, the 
repressor genes are “tagged” with a DNA sequence that targets the expressed protein 
for degradation. Figure 13.12C illustrates the temporal oscillations in fluorescence 
of a single cell measured when a fluorescent protein is expressed from a second 
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Figure 13.12 The Repressilator. A. Architecture of the three-gene network. B. Bi- 
furcation diagram of the Repressilator model showing the regions of monostability and 
oscillations in the «k, € parameter plane for different values of a and n. C. Time series 
of fluorescence emitted by a single cell harboring the Repressilator network. D. Com- 
parison of the single-cell time series in C with those generated by its siblings. Based on 
experimental data from (Elowitz and Leibler, 2000). 





Prteto promoter. At least 40% of cells oscillate with a period of 160+40 minutes. 
This period is significantly longer than the average cell division time. Hence, an 
oscillation initiated in a mother cell is completed in a daughter cell, and the 
oscillation phase is passed down from one generation to the next. However, as shown 
in figure 13.12D, siblings display marked differences in their progression through 
the oscillation cycle, and the network fails to support coherent oscillations at the 
population level. 

The Repressilator network fails to support coherent oscillations partly because 
of the significant differences in the rate at which the regulatory signal propagates 
through the three regulatory steps. Recall from section 13.4.4 that cells harboring 
the three-step linear cascade show marked variability in the onset of expression from 
the P promoter. This is expected to translate directly into significant differences in 
the oscillation period as the differences in response times in the linear cascade are 
equivalent to differences in delay times in the feedback network. Hence, the number 
of steps in the network makes it especially susceptible to stochastic effects. 

There are network architectures that generate oscillations with increased robust- 
ness against stochastic effects (Vilar et al., 2002). One example is a design, il- 
lustrated in figure 13.13A, where a transcriptional activator Rı enhances its own 
expression from the multi-input promoter Pj2 and that of a repressor Rə from the 
P; promoter. The Rə repressor in turn attenuates the transcription of the activator 
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Figure 13.13 Mixed positive and negative feedback oscillator. A. Architecture of the 
network. B. Example of oscillations generated in simulations of the network model with 
parameter values sı = 10, k12 = 100, n = 3, a = 107, and e = 0.1. C. Bifurcation 
diagram showing the regions of monostability, bistability and oscillations in the Ki2, Ki 
plane for different values of £ for a = 107°? and n = 3. D. Coherent dampened oscillations 
observed in cell populations carrying two variants of the network differing approximately 
four-fold in the level of expression from the Piz promoter. Based on experimental data 
from (Atkinson et al., 2003). 


by binding to the Piz promoter. Using the same assumptions as in the model of the 
Repressilator, the network can be modeled by the following dimensionless ODEs: 








dm, K12 ry 
<1 = . H . 13.16 
dt aed l+ry 1+r? om ( ) 
dmg Kirt 
a 7 Oe ™ 
dr; 


ae = e(mi — ri) i= 1;2 

A simulation of the system for parameters yielding sustained oscillations is given 
in figure 13.13B. They arise because of a time delay between the activation and the 
repression of transcription from the P2 promoter. When the repressor concentration 
is initially low, the positive feedback causes an increase in both activator and 
repressor expression. However, because the negative feedback involves an additional 
step, the expression of the repressor is delayed. As a result, and akin to the pulse- 
generating network discussed in section 13.5.1, the activator accumulates to high 
levels before the repressor can reach a concentration that is sufficiently high to 
shut down the expression from the Piz promoter. Once this occurs, activator 
expression ceases and the activator concentration declines. This in turn causes the 
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rate of repressor expression to decrease. The system subsequently returns to the low 
repressor state and a new cycle is initiated. Figure 13.13C shows the bifurcation 
diagram indicating the regions of monostability, bistability, and oscillations in the 
K1, K12 parameter space. The bistability is associated with the positive feedback 
and is more likely to occur in the absence of the negative feedback (that is, for 
Kı = 0). 

The network illustrated in figure 13.13A is implemented experimentally by aug- 
menting the bistable NtrC single-gene positive feedback network discussed in sec- 
tion 13.6.1 with a negative feedback (Atkinson et al., 2003). Recall that in this 
network the strength of the positive NtrC feedback is dependent on LaclI as the 
modified NtrC-activated Pama promoter (Pj in figure 13.13A) contains lacO op- 
erators. Hence, a negative feedback is readily added to the network by expressing 
LacI from a second NtrC-activated promoter (P, in figure 13.13A). Figure 13.13D 
shows two time series of the population-averaged expression level in cells harboring 
two different variants of the oscillator network that have an approximately four-fold 
difference in the expression rate from the modified Psjn4 promoter. In both cases, 
the population-averaged expression level exhibit dampened oscillations with a pe- 
riod of 10-12 hours. The dampening of the oscillations is not due to cells becoming 
desynchronized. Measurements of the expression of a fluorescence reporter protein 
in individual cells (data not shown, see (Atkinson et al., 2003)) indicate that the 
dampening also occurs at the level of single cells. Given that the cells divide about 
once per hour, the coherence of the oscillations, which appears to be maintained for 
the duration of the experiments, that is, 40-50 generations, is quite remarkable. It 
confirms the prediction that a network architecture combining positive and negative 
feedback should be more robust against noise. 





13.7 Conclusions 


In this chapter, we have discussed selected synthetic gene regulatory systems im- 
plemented experimentally in E. coli to investigate the dynamics of transcriptional 
regulatory networks in vivo and to create strains with novel characteristics. These 
systems support a range of non-trivial behaviors such as cellular memory, pulse 
generation, spatial pattern formation, and oscillatory gene expression. In all the 
examples, the networks are designed with the aid of mathematical models based on 
fairly simple, phenomenological descriptions of relationships between input and out- 
put signals. These models are used to predict systems properties and how changes in 
DNA sequence affect performance and dynamics. In all cases, an excellent agreement 
between model predictions and experimental results is obtained. This demonstrates 
the close link between the current modeling methodologies and biological reality. 
Consequently, there is good reason to believe these methodologies are also useful 
for the analysis of the more complex regulatory systems found in nature. Indeed, 
the systems presented and analyzed elsewhere in this book strongly indicate that 
this is the case. 
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Multilevel Modeling in Systems Biology: 
From Cells to Whole Organs 


Denis Noble 


Successful physiological systems analysis requires that we understand the functional 
interactions between the key components of cells, organs, and physiological systems, 
and how these interactions change in disease states. This information resides neither 
in the genome nor even in the individual proteins that genes code for since no genes 
code for interactions as such. It lies at the level of protein network interactions 
within the context of sub-cellular, cellular, tissue, organ, and system structures. 
There is therefore no alternative to copying nature and computing these interactions 
to determine the logic of healthy and diseased states. The rapid growth in biological 
databases; models of cells, tissues, and organs; and the development of powerful 
computing hardware and algorithms have made it possible to explore functionality 
in a quantitative manner all the way from the level of genes to whole organs and 
systems. This chapter discusses the philosophy of multilevel modeling and illustrates 
this development in the case of the heart. Systems physiology of the 21% century is 
set to become highly quantitative, and therefore one of the most computationally- 
intensive disciplines. 





14.1 Introduction: The Philosophy of Multilevel Simulation 


The emphasis in recent decades of biological research has been on breaking cells, 
organs, and systems down into their smallest components: the genes, proteins, and 
other molecules whose interactions are essential to life. We have succeeded so well 
that the amount of molecular data generated by the new technologies has completely 
overwhelmed our ability to understand it. Genomics has provided us with a massive 
“parts catalog” for the human body, the 25,000 or so genes, while proteomics seeks 
to define these individual “parts” and the structures they form in detail. The 
parts catalog still needs a lot of annotating (gene ontology), and the proteomics 
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side is still in its infancy, being much more challenging than sequencing genomes 
(see chapter 10). But, from the viewpoint of those interested in understanding cells, 
tissues, organs, and systems, there is as yet no “user’s guide” describing how these 
parts are put together to allow those interactions that sustain life or cause disease. 
The project to model at multiple levels between cells and organs, which is the 
Human Physiome Project (Crampin et al., 2004; Hunter et al., 2002), can be seen 
to aim precisely to achieve this. 

We have a long way to go because, in many cases, the cellular, organ, and 
system functions of genes and proteins are unknown, though clues sometimes 
come from homology in the gene sequences and other patterns being investigated 
by bioinformatics. Moreover, even when we understand function at the protein 
level, successful intervention, for example in drug therapy, depends on knowing 
how a protein behaves in context, as it interacts with the rest of the relevant 
cellular machinery to generate function at a higher level. Without this integrative 
knowledge, we may not even know in which disease states a receptor, enzyme, 
or transporter is relevant, and we will certainly encounter side effects that are 
unpredictable from molecular information alone. This is a major problem for the 
drug industry. My field of cardiac simulation is central to this problem since nearly 
half the compounds developed by the industry interact with the heart, sometimes 
with fatal effect (Muzikant and Penl, 2000). 

Inspecting genome databases alone will not get us very far in addressing these 
problems. The reason is simple. Genes code for protein sequences. They do not 
explicitly code for the interactions between proteins and other cell molecules and 
organelles that generate function. As the geneticist Gabriel Dover (2000) remarks 
“We don’t have a theory of interactions and until we do we cannot have a theory 
of development or a theory of evolution.” The challenge of developing a theory of 
interactions, which must be one of the major goals of systems biology, therefore also 
has implications for biology as a whole. We need to lead the way towards biology 
maturing as a science to join the physical sciences as a fully quantitative science, 
with fully-fledged theories within which computational biology can be embedded. 
Otherwise, what we do will be piecemeal, not integrated together. 

A major part of the difficulty is that much of the logic of the interactions in 
living systems is implicit. Wherever possible, nature leaves the interactions to the 
chemical properties of the molecules themselves and to the highly serendipitous 
way in which these properties have been exploited during evolution as nature has 
plundered its treasure chest of old genes to recruit new functions. It is as though 
the function of the genetic code is to build the components of a computer, which 
then self-assembles to run programs about which the genetic code knows nothing. 
The genetic code alone is not a program (Coen, 1999). Sydney Brenner (1998) 
expressed this very effectively when he wrote: “Genes can only specify the properties 
of the proteins they code for, and any integrative properties of the system must be 
‘computed’ by their interactions.” Brenner meant not only that biological systems 
themselves “compute” these interactions but also that in order to understand them 
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we need to compute them, and he concluded “this provides a framework for analysis 
by simulation.” 

Brenner also coined the term that is being used to describe multilevel modeling, 
when he referred to it as “middle-out” (Novartis Foundation, 2001). An exhaustive 
“bottom-up” reconstruction is impossible (“I know one approach that will fail, which 
is to start with genes, make proteins from them and to try to build things bottom- 
up” (Novartis Foundation, 2005)). The approach that can work is to start modeling 
at any of the levels at which the data is sufficient to generate a model and then 
to reach out to lower and higher levels. In this way we can avoid the problems of 
information overload and combinatorial explosion (Feytmans et al., 2005). 

In this chapter I will show how far we have advanced in using simulation to 
understand these interactions between the levels of genes, proteins, cells, and organs. 
I will refer mostly to the case of the heart since this is the organ in which such 
simulation is currently most advanced. 





14.2 Cellular Models of the Heart 


Many of the characteristic functions of the heart reside in the properties of the 
cells. Cells generate electrical signals that initiate a cascade of events leading to 
muscular contraction. Some of them also generate repetitive activity and so act 
as pacemakers. They also contain receptors that respond to neural and hormonal 
control to speed up or slow down the rhythm and to increase or decrease the force 
of contraction. Finally some, but not all, arrhythmic mechanisms can be found 
at the cellular level. Not surprisingly, therefore, modeling work in heart systems 
physiology has nearly always started at the cellular level. 

The first cardiac cell models (Noble, 1960, 1962) sought insight into the most 
obvious difference between electrical activity in heart and nerve: the duration of 
the action potential. A nerve action potential may last only 1 msec. Its function is 
to encode information as rapidly as possible. A human ventricular action potential 
may last 400 msec, during which time many events are triggered that initiate and 
control mechanical contraction. 

Weidmann’s (1951) pioneering work showed that the conductance during the 
action potential is very low. The experimental reason for this became clear with the 
discovery of the inward-rectifier potassium channel current, Ix; (Carmeliet, 1961; 
Hall et al., 1963; Hutter and Noble, 1960) (see figure 14.1, top). The permeability 
of this channel falls almost to zero during strong depolarization. These experiments 
were also the first to show that there are at least two Kt channels in the heart, 
Ixı and Ix (referredto as Ix in early work, but now known to consist of Ix, and 
Ixs (Noble and Tsien, 1969; Sanguinetti and Jurkiewicz, 1990)). The 1962 model 
(Noble, 1960, 1962), (figure 14.1, bottom) was constructed to determine whether 
this combination of Kt channels, together with a Hodgkin-Huxley type sodium 
channel (a channel protein showing voltage-dependent activation and inactivation 
processes) could explain all the classical Weidmann experiments on conductance 
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changes. The model not only succeeded in doing this, it also demonstrated that an 
energy-conserving plateau mechanism was an automatic consequence of the inward- 
rectifying properties of Ix ,. This has featured in all subsequent models, and it is 
a very important insight. The main advantage of a low conductance is minimizing 
energy expenditure. 

Unfortunately, however, nature achieved a low conductance plateau at the cost 
of making the recovery (repolarization) process fragile. Pharmaceutical companies 
today are struggling to deal with evolution’s answer to this problem, which was to 
entrust repolarization to the potassium channel igr. This channel protein, hERG 
(Novartis Foundation, 2005), is one of the most promiscuous receptors known: 
large ranges of drugs can enter the channel mouth to block it, and even more 
interact with the G-protein coupled receptors that control it. The consequence can 
be failed repolarization, and the triggering of potentially fatal disorders of cardiac 
rhythm, called arrhythmias (see http://georgetowncert.org/qtdrugs_torsades.asp). 
Computer simulation is now playing a role in attempting to find a way around this 
difficult and seemingly intractable problem (Bottino et al., 2005; Fink et al., 2005; 
Muzikant and Penl, 2000). 

The main defect of the 1962 model was that it included only one voltage-gated 
inward channel current, Iya. There was a good reason for this. Calcium channels 
had not then been discovered. There was, nevertheless, a clue in the model that 
something important was missing. The only way in which the model could be made 
to work was to greatly extend the voltage range of the sodium current by reducing 
the voltage dependence of the sodium activation process. In effect, the sodium 
current was made to serve the function of both the sodium and calcium channels 
so far as the plateau is concerned. There was a clear prediction here: either sodium 
channels in the heart are quantitatively different from those in nerves, or other 
inward current-carrying channels must exist. Both predictions are correct. 

The first successful measurements of ion channel activity under controlled mem- 
brane potential conditions (using the technique known as the voltage clamp) came 
in 1964 (Deck and Trautwein, 1964) and they rapidly led to the discovery of the 
cardiac calcium current (Reuter, 1967). By the end of the 1960s, therefore, it was 
already clear that the 1962 model needed replacing. However, the insights it gave on 
the behavior of the potassium currents are still valid. Systems biology can proceed 
in a stepwise fashion, in which different parts of an integrative analysis get clarified 
at different stages of the iteration between simulation and experiment. 

In addition to the discovery of the calcium current, the early voltage clamp 
experiments also revealed multiple components of Ix (Noble and Tsien, 1969), now 
referred to as Ix, and Ix, (Sanguinetti and Jurkiewicz, 1990). We also showed that 
these slow gated currents in the plateau range of potentials were quite distinct from 
those near the resting potential, that is, that there were two separate voltage ranges 
in which very slow conductance changes could be observed (Noble and Tsien, 1969). 
These experiments formed the basis of the McAllister, Noble, and Tsien (MNT) 
model (McAllister et al., 1975). 
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Figure 14.1 Top: experimental basis of the first analysis of the integrative role of 
potassium channels. Redrawn (Noble, 2002b) from (Hall et al., 1963). The solid line shows 
the total membrane current recorded in cardiac cells (from the conducting system of the 
heart called Purkinje fibres) in a sodium-depleted solution. The inward-rectifying current 
was identified as ix1, which is extrapolated here as nearly zero at positive potentials. The 
outward-rectifying current, Ix, is now known to be mostly formed by the component Ix;,. 
The horizontal arrow indicates the trajectory at the beginning of the action potential, 
while the vertical arrow indicates the time-dependent activation of Ix, which initiates 
repolarization. Bottom: Sodium and potassium conductance changes computed from the 
1962 model of the Purkinje fibre. Two cycles of activity are shown. The conductances are 
plotted on a logarithmic scale to accommodate the large changes in sodium conductance. 
Note the persistent level of sodium conductance during the plateau of the action potential, 
which is about 2% of the peak conductance. Note also the rapid fall in potassium 
conductance at the beginning of the action potential. This is attributable to the properties 
of the inward rectifier channel ix1. 
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The MNT model reconstructed a much wider range of experimental results, and 
it did so with great accuracy in some cases. A good example of this was the re- 
construction of the paradoxical effect of small current pulses on the pacemaker 
depolarization in Purkinje fibres—paradoxical because brief depolarizations (pos- 
itive voltage deflections) slow the process and brief hyperpolarizations (negative 
voltage deflections) greatly accelerate it. This is paradoxical since the pacemaker 
potential itself is a positive deflection so that one would expect positive deflections 
to accelerate it. Reconstructing paradoxical or counterintuitive results is, of course, 
a major function of modeling work. This is one of the roles of modeling in unraveling 
complexity in biological systems. 

But the MNT model also contained the seeds of a spectacular failure. Following 
the experimental evidence (Noble and Tsien, 1968) it attributed the slow conduc- 
tance changes near the resting potential to a slow gated potassium current, Igos. In 
fact, what became the “pacemaker current,” or Iş, is an inward current activated 
by hyperpolarization (DiFrancesco, 1981), not an outward current activated by de- 
polarisation. At the time it seemed hard to imagine a more serious failure than 
getting both the current direction and the gating by voltage completely wrong. 
There cannot be much doubt therefore that this stage in the iterative interaction 
between experiment and simulation created a major problem of credibility. Perhaps 
cardiac electrophysiology was not really ready for modeling cellular systems to be 
successful? 

This is the point at which to emphasize one of the important points about the 
philosophy of simulation: it is one of the functions of models to be wrong. Of course, 
there are many ways of being wrong, and I am not talking here of failing in arbitrary 
or purely contingent ways, but in ways that advance our understanding by exploring 
the possible logics of complex systems and determining which are most accurate. 
Again, this situation is familiar to those working in simulation studies in engineering 
or cosmology or in many other physical sciences. And, in fact, the failure of the MNT 
model is one of the most instructive examples of experiment-simulation interaction 
in physiology, and of subsequent successful model development (see Noble (1984)). 

The MNT model was also the point of departure for the ground-breaking work 
of Beeler and Reuter (1977) who developed the first ventricular cell model. (Ven- 
tricular cells are the real workhorse of the heart; they form the mass of muscle that 
does all the pumping into the arterial blood system.) As they wrote of their model: 
“in a sense, it forms a companion presentation to the recent publication of McAllis- 
ter et al. (1975) on a numerical reconstruction of the cardiac Purkinje fibre action 
potential. There are sufficiently many and important differences between these two 
types of cardiac tissue, both functionally and experimentally, that a more or less 
complete picture of membrane ionic currents in the myocardium must include both 
simulations.” For a recent assessment of this model and the subsequent Luo-Rudy 
models see Noble and Rudy (2001). 
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14.3 Connecting to Ion Pumps and Calcium Cycling 


New ground in modeling cardiac cells was broken with the DiFrancesco-Noble model 
(DiFrancesco and Noble, 1985). The incorporation not only of ion channels (follow- 
ing the Hodgkin-Huxley paradigm of voltage-dependent gated channel proteins) but 
also of ion exchangers, such as Na-K exchange (the sodium pump), Na-Ca exchange, 
the SR calcium pump, and, more recently, the transporters involved in controlling 
cellular pH (Ch’en et al., 1998), was a fundamental advance since these are essen- 
tial to the study of some disease states such as weak contraction (congestive heart 
failure) (Winslow et al., 1999) and impaired blood supply (ischaemic heart disease). 

The greatly increased complexity of the DiFrancesco-Noble model, which for the 
first time also represented intracellular events by incorporating a model of calcium 
release from the sarcoplasmic reticulum, increased both the range of predictions 
and the opportunities for failure. Here I will limit myself to one example of each. 

The most influential prediction was that relating to the sodium-calcium ex- 
changer. In the early 1980s it was still widely thought that the electrically neutral 
stoichiometry (Na:Ca = 2:1) derived from early flux measurements was correct. The 
DiFrancesco-Noble model achieved two important conclusions. The first was that, 
with the experimentally known Nat gradient, there simply was not enough energy 
in a neutral exchanger to keep resting intracellular calcium levels below 1 uM, that 
is, at a level low enough to permit relaxation to occur. Switching to a stoichiom- 
etry of 3:1 readily allowed resting calcium to be maintained below 100 nM. This 
automatically led to the prediction that there must be a current carried by the Na- 
Ca exchanger and that, if this exchanger was activated by intracellular calcium, it 
must also be strongly time-dependent since intracellular calcium varies by an order 
of magnitude during each action potential. Even as the model was being published, 
experiments demonstrating the current Iyvaca were being performed (Kimura et al., 
1986), and the variation of this current during activity was being revealed either 
as a late component of inward current or as a current tail on repolarization (Egan 
et al., 1989). 

This prediction has turned out to have very important consequences for the 
elucidation of some of the mechanisms of cardiac arrhythmia in disease states in 
which cells accumulate sodium and calcium, either through loss of energy supply, 
as in ischaemia, or as a consequence of reduced activity of the Na-K ATPase (Na 
pump) as in treatment with cardiac glycosides. At a critical level of sodium and 
calcium accumulation, calcium release occurs spontaneously and becomes repetitive 
between sodium concentrations around 13 and 22 mM (Ch’en et al., 1998; Varghese 
and Winslow, 1994; Winslow et al., 1999)—-see figure 14.2—a phenomenon also seen 
experimentally. Each calcium release activates inward current carried by sodium- 
calcium exchange which, if large enough, can trigger additional (ectopic) action 
potentials (Noble, 2002a) as shown in figure 14.6. 

The main defect of the DiFrancesco-Noble model was that the intracellular 
calcium transient was far too large, mainly because the model did not represent 
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Simulation of ion and current changes during ischaemia 


voltage 





Figure 14.2 Calcium oscillations computed in a model of the Purkinje fibre during the 
rise in intracellular sodium [Na]; following blockage of the sodium-potassium ATPase. 
Each oscillation of intracellular calcium [Ca]; triggers inward sodium-calcium exchange 
current inaca (Ch’en et al., 1998). These computations were done under voltage clamp 
conditions. 


the attachment of calcium to intracellular proteins. This signaled the need to 
incorporate intracellular calcium buffering. 

This deficiency was tackled in the Hilgemann-Noble (Hilgemann and Noble, 1987) 
modeling of the atrial action potential. Although this was directed towards atrial 
cells, it also provided a basis for modeling ventricular cells in species (rat, mouse) 
with short ventricular action potentials, and many of its features were adopted in 
later ventricular cell models of species with high plateaus (Luo and Rudy, 1994, 
1991; Noble et al., 1991, 1998). 

The Hilgemann-Noble model addressed a number of integrative systems questions 
concerning calcium balance: 


1. When does the calcium that enters during each action potential return to the 
extracellular space? Does it do this during the rest period between contractions 
(as most people had presumed) or during the contraction itself, that is, during, 
not after, the action potential? Hilgemann (Hilgemann, 1986) showed that the 
recovery of extracellular calcium (in intercellular clefts) occurs remarkably quickly 
(see figure 14.3, inset). In fact, net calcium efflux is established as soon as 20 msec 
after the beginning of the action potential, which at that time was considered to be 
surprisingly soon. Calcium activation of efflux via the Na-Ca exchanger achieved 
this in the model (see figure 14.3 — compare the computed trace [Ca], with the 
experimental trace labeled [Ca],). 
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Figure 14.3 The first reconstruction of calcium balance in cardiac cells. The Hilgemann- 
Noble model (Hilgemann and Noble, 1987) incorporated complete calcium cycling, such 
that intracellular and extracellular calcium levels returned to their original state after each 
cycle and that the effects of sudden changes in frequency could be reproduced. Left: com- 
puted action potential (AP), intracellular calcium transient, contraction (represented by 
cross-bridge formation), and extracellular calcium transient. Inset: experimental recording 


of action potential (AP), cell motion, and extracellular calcium transient. 


2. Where was the current that this would generate and did it correspond to the 
quantity of calcium that the exchanger needed to pump? Mitchell et al. (1984) had 
shown that replacement of sodium with lithium removes the late plateau. This was 
the first experimental evidence that the late plateau in action potentials with this 
shape might be maintained by sodium-calcium exchange current. The Hilgemann- 


Noble model showed that this is precisely what one would expect. 


3. Could a model of the SR that reproduces at least the major features of Fabiato’s 
experiments (Fabiato, 1983) showing calcium-induced calcium release (CICR) be 
incorporated into the cell models and integrate with whatever were the answers 
to questions 1-2? This was a major challenge. The model followed as much of 
the Fabiato data as possible, but the conclusions were that the modeling, while 
broadly consistent with the Fabiato work, could not be based on that alone. It is an 
important function of simulation to reveal when experimental data needs extending. 


4. Were the quantities of calcium, free and bound, at each stage of the cycle 
consistent with the properties of the cytosol buffers? The answer here was a 
very satisfactory yes. The great majority of the cytosol calcium is bound so that, 
although much more calcium movement was involved, the free calcium transients 


were much smaller, within the experimental range. 
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There were, however, some gross inadequacies in the calcium dynamics. An 
additional voltage-dependence of Ca release was inserted to obtain a fast calcium 
transient. This was a compromise that requires more detailed modeling of the 
spaces immediately between the cell membrane and the intracellular machinery—a 
space where calcium channels and the ryanodine receptors interact—a problem 
later tackled by Jafri, Rice and Winslow (1998) and by Noble et al. (1998). 
Another problem was how the conclusions would apply to action potentials with 
high plateaus. This was tackled both experimentally (LeGuennec and Noble, 1994) 
and computationally (Noble et al., 1998). The answer is that the high plateau in 
ventricular cells of guinea pig, dog, human, and so forth greatly delays the reversal 
of the sodium calcium exchanger so that net calcium entry continues for a longer 
fraction of the action potential. This property is important in determining the way 
in which the force of contraction varies with the frequency of the heart beat. 

Intracellular calcium dynamics have now become a major focus of simulation 
work (Coombes et al., 2004; Eisner et al., 2000; Hinch, 2004; Jafri et al., 1998; 
Puglisi et al., 2004; Soeller and Cannell, 2004). So also has the modeling of active 
transport and cardiac energetics (Matsuoka et al., 2004; Smith and Crampin, 2004), 
and the regulation by cell signaling networks (Saucerman and McCulloch, 2004). 
These developments are opening up the way for major developments in the use 
of cardiac models in understanding disease states, where calcium dynamics, active 
transport, and cell signaling are often affected. 





14.4 Linking to the Genetic Level 


An important strength of models based on reconstructing the functional properties 
of proteins in cellular structural contexts is that it is possible for the models to reach 
down to the genetic level, for example by reconstructing the effects of particular 
mutations when these are characterized by changes in protein function (Noble, 
2002d). 

An example of this approach is the use of state-specific Markov models of the 
cardiac sodium channel (Clancy and Rudy, 1999) simulating the behavior of the 
wild-type and of a mutant sodium channel. The simulated mutation was the A KPQ 
mutation, a three-amino-acid deletion that affects the channel inactivation and is 
associated with a congenital form of the long-QT syndrome, known as LQT3. The 
simulations showed that mutant channel reopenings from the inactivated state and 
channel bursting due to a transient failure of inactivation generate a persistent 
inward sodium current during the action potential plateau in the mutant cell. This 
causes major prolongation of repolarization and the development of arrhythmogenic 
early-after-depolarizations at slow pacing rates, a behavior that is consistent with 
the clinical presentation of bradycardia-related arrhythmogenic episodes during 
sleep or relaxation in LQT3 patients. 

Another sodium channel mutation that has been, at least partially, reconstructed 
is a mis-sense mutation that affects the voltage dependence of sodium channel inac- 
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tivation and which is responsible for one form of idiopathic ventricular fibrillation. 
In this case, small shifts of the voltage dependence of inactivation generate early- 
after-depolarizations that may underlie fatal arrhythmia (Noble and Noble, 2000). 
Simulation can also unravel the way in which the effects of these genetic mutations 
interact with drugs to explain why some people are particularly prone to arrhythmic 
side effects of many drugs (Noble, 2003b). 

Early after-depolarizations are also responsible for the arrhythmias of congestive 
heart failure. Winslow et al. (2001) have modeled this process based on experimen- 
tally determined changes in gene expression levels for several of the transporter 
proteins involved. 

These examples highlight the ability of cellular models to predict the arrhythmo- 
genic consequences of genetic and ion channel abnormalities either of behavior or of 
expression levels. Given the present explosion of genetic information, such studies 
will continue to be at the forefront of modeling efforts in the next decade. Con- 
necting the genome to physiology is one of the exciting prospects for computational 
systems biology. 





14.5 Linking to Biochemistry: Counterintuitive Predictions 


Complex systems are characterized by the fact that the results of modeling them 
are frequently counterintuitive. Beyond a certain degree of complexity, armchair 
(qualitative) thinking is not only inadequate, it can even be misleading. A good 
example of this comes from the extension of cellular models to include some of the 
biochemical changes that occur during ischaemia (Ch’en et al., 1998). This work 
succeeds in reconstructing arrhythmias attributable to delayed after-depolarizations 
that arise as a consequence of intracellular calcium oscillations in conditions of 
sodium-calcium overload. These oscillations generate an inward current carried 
by the sodium-calcium exchanger which can lead to premature excitation. This 
work has led to some interesting counterintuitive predictions concerning up- and 
down-regulation of sodium-calcium exchange in disease states (Noble, 2002c). This 
transporter is currently a focus of anti-arrhythmia drug therapy. Simulation is 
playing an important role in clarifying and assessing the mechanism of action of 
such drugs. 

Another area in which modeling has been rich in counterintuitive results is 
that of mechano-electric feedback. Kohl and Sachs (2001) describe the extent to 
which this feedback mechanism has been unraveled in elegant experimental and 
computational work. Some of the results, particularly on the actions of changes in 
cell volume (which are important in many disease states) are unexpected and have 
been responsible for determining the next stage in experimental work. Indeed, it 
is hard to see how such unraveling of complex physiological processes can occur 
without the iterative interaction between experiment and simulation. 
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14.6 Linking to Pharmacology: Assessing and Predicting Drug Actions 


Most drugs act on proteins such as receptors, channels, transporters, and enzymes. 
Models that reach down to the protein level are therefore relevant to assessing and 
predicting drug actions. Cardiac simulations have already been used in assessing 
drug action by the Food and Drug Administration in the United States, and we 
can expect this kind of use of biological models to increase as their complexity 
and power grows. An example of the detailed use of cardiac cell models in drug 
development can be found in Bottino et al. (2005) who used reverse engineering to 
determine the profile of action of drugs on ion channels from information on their 
effects on the action potential. I have reviewed some of these developments in more 
detail elsewhere (Noble and Colatsky, 2000; Noble et al., 1999). One obvious use 
in the case of the heart is in assessing the cardiac safety of drugs. Around half 
the drug withdrawals that have occurred in the United States post-launch since 
1998 have been attributable to cardiac side effects, often in the form of effects 
on the electrocardiogram and consequent arrhythmias. This is a large and very 
expensive form of attrition. Since virtually all the ion transporters involved in 
cardiac repolarization are now modeled and very realistic simulations of the T wave 
of the electrocardiogram can be obtained, when these models are incorporated into 
3-dimensional cardiac tissue models it is possible to use in silico screens for drug 
development. One of the reasons that this is necessary is that the electrocardiogram 
is, unfortunately, an unreliable indicator of potential arrhythmogenicity. Similar 
changes in form of the electrocardiogram can be induced by very different molecular 
and cellular effects, some benign, others dangerous. We need to understand and 
predict the mechanisms all the way from individual channel properties through 
to the electrocardiogram. This goal is within reach, particularly as we acquire 
more experience of the incorporation of accurate cellular models into anatomically 
detailed organ models (see below). 

Another use of simulation in drug discovery will be in screening drugs for multiple 
actions. Very few drugs that act on the heart bind to just one receptor. It is much 
more common for 2, 3 or, even more receptors or channels to be affected. This is 
particularly true for drugs that act on the sodium-calcium exchanger (Watanabe 
and Kimura, 2000). An important point to realize here is that multisite action may 
actually be beneficial. The reviews referred to above give examples of multireceptor 
drug actions that would be expected to be beneficial. I predict that this will in fact 
be one of the ways in which more rational discovery of anti-arrhythmic drugs may 
occur. In regulating cardiac function, nature has developed many multiple-action 
processes, particularly those regulated by G-protein coupled receptors. In seeking 
for more “natural” ways of intervening in disease states, we should also be seeking to 
play the orchestra of proteins in more subtle ways. We need simulation to guide us 
through the complexity and to understand multiple action functionality. Examples 
of this approach to combinatorial drug action in computational biology of the heart 
now exist (Noble, 2003b; Noble and Colatsky, 2000). 
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14.7 Linking to Tissues and Organs 


In the case of the heart, in addition to the data-rich cellular level, there is also 
data-rich modeling of the 3-dimensional geometry of the whole organ (Costa et al., 
1996; LeGrice et al., 2001). Connecting this level to that of cell modeling has been 
an exciting venture (Crampin et al., 2004; Kohl et al., 2000; Smith et al., 2001). 
Anatomically detailed models of the ventricles, including fiber orientations and 
sheet structure, have been used to incorporate the cellular models in an attempt to 
reconstruct the electrical and mechanical behavior of the whole organ. 





Figure 14.4 Spread of the electrical activation wavefront in an anatomically detailed 
cardiac model. Earliest activation occurs at the left ventricular endocardial surface near 
the apex (left). Activation then spreads in endocardial-to-epicardial direction (outwards) 
and from the apex towards the base of the heart (upwards, middle frames). The activation 
sequence is strongly influenced by the fibrous-sheet architecture of the myocardium, as 
illustrated by the non-uniform transmission of excitation. Black = activation wavefront; 
white = endocardial surface. 


Figure 14.4 shows stills from a simulation in which the spread of the activation 
wavefront is reconstructed. This is heavily influenced by cardiac ultra-structure, 
with preferential conduction along the fiber-sheet axes, and the result corresponds 
well with that obtained from multi-electrode recording from dog hearts in situ. 
Accurate reconstruction of the depolarization wavefront promises to provide recon- 
struction of the largest phases of the electrocardiogram. Other parts of the organ, 
including the pacemaker region (sinus node), the atrium (the chambers receiving 
venous blood), and the specialized conducting system are now being incorporated 
into the model heart so that we can look forward to the first example of reconstruc- 
tion of a complete physiological process from the level of protein function right up 
to routine clinical observation. Work is in progress in a number of laboratories on 
simulation of the sinus node (Boyett et al., 2003, 1999; Dobrzynski et al., 2003; 
Garny et al., 2003) and atrium (Blanc et al., 2001; Garny et al., 2000; Harrild and 
Henriquez, 2000). The whole ventricular model has already been incorporated into 
a virtual torso (Bradley et al., 1997), including the electrical conducting properties 
of the different tissues, to extend the external field computations to reconstruction 
of multiple-lead chest and limb recording. 
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14.8 Coronary 


Circulation 


Ischaemic heart disease is a major cause of serious incapacity and mortality. It 
is also a good example of the fact that most disease states are multifactorial. 
Very few diseases are attributable to single gene or protein malfunction. As noted 
above, cellular reconstructions of the metabolic and electrophysiological processes 
that occur following deprivation of the energy supply to cardiac cells have already 
advanced to the point at which some arrhythmic mechanisms can be reproduced. 
The initiating process in such energy deprivation is restriction or blockage of 
coronary arteries. This is another example where modeling at different data-rich 
levels is holding out the prospect of very exciting integration of function. Figure 14.5 
shows some of the spectacular modeling of the coronary circulation (Smith et al., 
2000, 2001). These are stills from a simulation in which the blood flow through 
an anatomically-detailed model of the coronary circulation is computed while the 
ventricles are beating. The simulation therefore also included the deformation that 
occurs as mechanical events influence blood flow. 





Figure 14.5 Flow calculations coupled to the deforming myocardium. The color coding 
represents transmural pressure acting on the coronary vessels from the myocardial stress 
(dark gray = zero pressure, light gray = peak pressure). The deformation states are (from 
left to right) zero pressure, end-diastole, early systole, and late systole. 


This model has already been used to investigate the changes in blood flow that 
occur following constriction or block of one of the main arterial branches, and work 
is in progress to connect this to the modeling of ischaemia at the cell and tissue 
level (see figure 14.6). If we can also connect the cellular mechanisms of arrhythmia 
to the processes by which regular excitation breaks down into the multiple wavelets 
of ventricular fibrillation (Panfilov and Kerkhof, 2004) then yet another “grand 
challenge” for integrative physiological computation will come within range: the 
full-scale reconstruction of a coronary heart attack. 

This is a suitable point at which to note that I chose the term grand challenge 
deliberately. This kind of work requires massive computer power. The whole organ 
simulations described here require many hours of computation using supercom- 
puters. (By contrast, the single cell models can be run faster on a PC or laptop 
than in real time.) Future progress will be determined partly by the availability of 
computing capacity. 
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Figure 14.6 Left: the coronary circulation model shown in figure 14.5 has been subjected 
to a constriction of one of the main branches leading to blocked blood flow in the regions 
colored black. (Figure kindly provided by Nic Smith.) Right: simulation of ectopic beats 
using the DiFrancesco-Noble 1985 Purkinje fiber model (Noble, 2002a) in conditions of 
calcium overload of the kind that occurs in ischaemic tissue. To simulate sodium/calcium 
overload, [Na]; was increased from 8 to 12 mM (see figure 14.2). Oscillatory calcium 
changes (bottom) induce inward sodium-calcium exchange current (middle) leading to 
initiation of action potentials (above). The first action potential is evoked by a current 
pulse. The second two are initiated by calcium oscillations. Note that the rise in [Ca]; and 
the flow of inward Na-Ca exchange current occur before the depolarization. Linking these 
two levels of modeling to create a complete model of coronary heart attack is one of the 
“grand challenges” requiring massive computer power. 


Blood flow within the chambers of the heart, including the movement of valves, 
has been elegantly modeled by Peskin and McQueen (1993) and this has been 
extended to the study of diastolic mechanical function (Kovacs et al., 2001). 





14.9 The Future: From Genome to Proteome to Physiome 


Integrative multilevel modeling of biological systems is an important technique for 
organizing and integrating vast amounts of biological information. Although this 
article has focused on modeling of the heart, it is important to note that multilevel 
biological simulation is now being done for a wide range of pathways, cells, and 
systems. The role of in silico biology in medical and pharmaceutical research is 
likely to become increasingly prominent as we seek to exploit the data generated 
through rapid gene sequencing and proteomic mapping through to creating the 
physiome. 

However, progress will be significantly enhanced by enabling ever greater numbers 
of researchers to use and verify models in the course of their everyday experimental 
work. It has been extremely difficult to transfer models between research centers 
or to extend existing models so that more complex models can be constructed 
in an object-oriented or modular fashion. This process will be enhanced by the 
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development of uniform standards for representing and communicating the content 
of models and by the wide distribution of software tools that permit even non- 
modelers to access, execute, and improve existing models. Increasingly, publication 
of models is accompanied by their availability on Web sites. And the process of 
establishing standards of communication and languages is developing (Lloyd et al., 
2004). 

Once this is achieved, we can confidently predict an explosion in the development 
of integrated model cells, organs, and systems. In a few years we shall all wonder 
how we ever managed to do without them in biological research. So far as drug 
development is concerned, there will certainly be a major change as these tools 
come on line and rapidly increase in their power. This will grow in a nonlinear way 
with the degree of biological detail that is incorporated. The number of interactions 
modeled increases much faster than the number of components. Biology is set to 
become highly quantitative in the 21% century. It will become a computer-intensive 
discipline. 
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Computational power has followed Moore’s Law quite well, and modeling in biology 
has taken full advantage of this. Nevertheless, as models get bigger and more aspects 
of the models need to be inferred, many techniques that are applicable for small 
scale models become inapplicable. A review of basic aspects of algorithms and 
data structures is provided, along with a summary of the computational aspects of 
sorting, searching, dynamic programming, graph theory, dynamical systems, and 
noise reduction algorithms. 





15.1 Introduction 


How many interacting species of molecules can one realistically mathematically 
model? Quite apart from the question of comprehending the interactions of several 
hundred thousand molecular species, there are limitations on systems biology 
that arise from simple combinatorics coupled with the lack of precise biological 
knowledge of the properties in vivo of most molecules. 

The only way to overcome such limitations is to incorporate as much biological 
knowledge as possible. The task of modeling without using biological knowledge 
is, frankly, computationally impossible. As a simple example, consider a set of ten 
independent hypothetical interactions. In a eukaryotic system, a single protein may 
well have about ten splice sites or interactions. Our task is to ascertain which 
of these hypotheses is present or absent, given some experimental data. We are 
immediately faced with the task of generating 21° = 1,028 independent models, 
to fit each individual model with the data, and then determine the correct model 
by looking at some goodness-of-fit criterion. If this is the case for a single protein, 
consider the situation for modeling the complete proteome. This is an example of 
the combinatorial explosion of modeling in the face of partial knowledge and limited 
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observability. Of course, there are sampling strategies that one can attempt to use 
to lessen the burden in this particular example, but in general, in biology, one must 
use as much biological information as possible to reduce the number of models that 
need to be evaluated. 

Many properties, such as adaptability and robustness (chapter 2), are apparent 
only in dynamical simulations with limited changes in inputs and/or environment 
leading to no qualitative change in behavior in the case of adaptability, or limited 
changes in reaction rates or deletions of a few species producing no qualitative 
change in the case of robustness. Even if we know in advance which qualitative 
feature of the dynamical behaviour we wish to preserve (and this is rarely the case), 
there is no simple way to infer which changes will or will not preserve the qualitative 
feature from the static properties of a mathematical model, unless the model 
is specifically constructed in accord with standard techniques in control theory 
(chapter 12) and robust design. Is a model so designed mimicking biology at the 
biochemical level? Not necessarily. This underscores the fact that a mathematical 
model is, first and foremost, a model. It is useful as an attempted abstraction 
and simplification (chapter 3) of the essential features of the phenomenon being 
modeled and as a hypothesis-generating tool for further experiments, but it is 
not computationally feasible to make models that are in silico exact replicas of 
biochemistry in vivo for anything beyond a trivial scale. 

Much has been made of the availability of large-scale data sets in systems biology. 
While it is certainly true that these cell- or organ-level complete coverage data sets 
allow screening of most of the relevant factors in any given biological phenomenon, 
none of the results of such screening can be translated directly into mathematical 
models, for several reasons: 


1. Cellular localization information is absent (chapter 11). 


2. Due to resource and experimental limitations, the time scales of measurement 
are rarely fine enough to allow observation of both very quick transient initial or 
priming responses and enough detail of following longer term behavior (chapter 6). 


3. Determinations of the interactions of proteins oftentimes have large false-positive 
and false-negative rates (chapter 10). 


4. The models are too large and have too much missing information (reaction rates, 
localization, concentrations) to be computationally tractable, even on massively 
parallel supercomputers. 


While new and improving technologies will alleviate these problems in the fullness of 
time, predictive modeling at the present juncture requires cognizance of this reality. 
This chapter is concerned with an overview of various computational algorithms 
that are relevant for systems biology, with the specific aim of fleshing out (4) above. 
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15.2 Algorithms 


Algorithms for solving computational problems, either discrete (such as finding 
cliques in graphs) or continuous (parameter optimization for a dynamical model), 
are judged on correctness, efficiency, and ease of implementation. A heuristic, on 
the other hand, is probably intuitively clear in the way it approaches a problem, 
but does not come with any guarantee of correctness (Rawlins, 1991; Knuth, 1997). 

There are two main ways to compare algorithms. One corresponds to the random- 
access-machine (RAM) model of computation, and the other is an asympototic 
analysis of worst case complexity. The RAM model counts computational and 
memory access steps, but is somewhat simplistic in that it assumes that all basic 
and arithmetic operations take equal amounts of time. This is not true for real 
processors, of course, but anything more detailed would be processor-dependent, 
and therefore not of as much utility in comparing algorithms. The RAM model 
tells us how an algorithm might work on a given input, but it does not give an 
indication of how the algorithm might fare on a typical input, or the worst or best 
case scenarios for different inputs. These are important issues, especially if we do 
not know too well the kinds of data we might obtain in biological experiments. 

The growth rate of an algorithm gives us an idea of how big an input we can 
realistically compute with it. The growth rate is computed by counting all the steps 
needed to carry out the algorithm, with each computation or memory access step 
counting as one step, for a given input size n, which might, for example, be the 
number of protein nodes in a graph of yeast-2-hybrid predictions of protein-protein 
interactions. 


1. An algorithm whose growth rate is n! becomes useless before n = 20. 
2. An algorithm whose growth rate is 2” becomes useless before n = 40. 


3. An algorithm whose growth rate is n? will work reasonably up to n = 100, 
but will rapidly become impractical beyond this. For n = 10°, such an algorithm 
requires 10!° steps. 


4. An algorithm whose growth rate is n will likely be useful up to n = 10°, and 
this holds even for algorithms with logarithmic corrections such as n logn. 


5. An algorithm whose growth rate is logn will be useful forever. There are such 
algorithms, for example, binary search. 


There are usually constant multiplicative factors associated with these growth rates, 
and they make some difference to assessments of the viability of the algorithms for 
small values of n, but for asymptotically large values of n, these constants are 
usually irrelevant. Algorithms of types 1 and 2 are not relevant for the scale of 
problems usually of interest in systems biology. For problems of type 3, bigger and 
faster computers will make a difference. 

To understand which algorithm to use, we must describe the modeling problem 
in a manner such that we are able to look in repositories of algorithms and find 
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appropriate results. The common terms used in formal descriptions of algorithms are 
abstract structures such as graphs, permutations, trees, and sets. Permutations are 
relevant if the problem requires different arrangements, orderings, or sequences. An 
example in systems biology is system identification, when we do not know the order 
in which molecules appear in a pathway, or in what temporal order the molecules 
interact with each other. Sets are relevant when the problem seeks a cluster, a 
collection, or any selection from a set of items. Trees appear in problems with 
hierarchical relationships. Submodules embedded in modules (chapter 3) might be 
one example in hierarchical dynamical models. Graphs appear in problems involving 
networks, circuits, or webs. Points representing locations in some space appear 
in problems like protein conformations or protein localization in a cell. Strings 
appear in problems involving patterns or labels. Finding consensus sequences for 
transcription factor binding sites upstream of a set of genes is a prototypical string 
matching problem. 

Data structures are the flip side of the computational cost coin. The organization 
of the data impacts algorithm performance greatly. A basic understanding of the 
types of common data types used in computer science is helpful in deciding the 
storage of experimental data. While biologists typically will not need the detailed 
implementations of the data structures introduced below, an acquaintance may 
facilitate communications with collaborators in other sciences. 

A container is a data type which permits storage and retrieval of data irrespective 
of the content of the data. Such a data type has access only through insertion or 
retrieval operations, and might be implemented as a stack (permitting last-in-first- 
out (LIFO) retrieval) 


= [Item,,][Item,,—][Item, 9]... [Item;][Itempo] (15.1) 
as a queue (permitting first-in-first-out (FIFO) retrieval 
— [Item,,][Item,,_;][Item,—] .. . [Item,][Itemo] > (15.2) 


useful for algorithms where the order of the stored data is important) or as a table 
(permitting retrieval indexed by position, as in an array) 
0 1 2 ie n 


l l l l 1 (15.3) 
[Itemo] [Item,] [Item 2] ... [Item,] 
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A dictionary is a data type suited for accessing data by content. Thus a dictionary 
has keys and data referred to by the keys. Dictionaries permit search for data given 
a key, insertion, and deletion. 


l 


Keyo Valueo 
Key, — Value; 


Key, Valuez (15.4) 


| 


Key, — Value, 


Structures that can implement the dictionary data type include linked lists (con- 
stant time insertion and deletion but more computational steps for search), arrays 
(constant time search but longer insertion and deletion times), binary search trees, 
or hash tables. Binary search trees store data in a tree structure with each node 
having 0, 1, or 2 offspring. If the keys in the dictionary have an ordering (in other 
words, for any two keys we can decide if x < y or y < x; an example of an ordering 
is alphabetical ordering), when we search for the data labeled by a key x, we search 
from the root node up the tree, taking the left branch if the key at the root node is 
larger than x and the right branch otherwise, and proceeding onwards recursively. 


Kı K; K4 Ko 


` Z N 4 
Kə K3 (15.5) 


N K 


root 


If the items were inserted more or less at random, the search will involve about 
logn steps. As an example of the difference between worst-case performance and 
average performance, notice that if the items were inserted in an ordered fashion, 
the search will take n steps instead. 

While containers and dictionaries are the most common data types, more spe- 
cialized data types are valuable when the data to be stored has more structure. 
Examples of such data structures are suffix trees to store strings, kd-trees to store 
geometric objects, adjacency lists (lists of pairs of connected vertices) or adjacency 
matrices (matrices indexed by the vertices in the graph, with non-zero entries corre- 
sponding to vertices linked by an edge) to store sparse or densely connected graphs 
respectively, and set data stored as hypergraphs or in the form of dictionaries asso- 
ciated with subsets. 
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15.3 Sorting 


Sorting is a basic building block of algorithm design and takes about nlogn steps 
for sorting a set of n elements. Thus, sorting is the type of building block in an 
algorithm that can be used for fairly large sets of items. One should use sorting as 
much as necessary without worrying about rendering the problem computationally 
intractable. Sorting is the basis for 


A. Searching: After sorting the keys, one can test for the presence of an item in a 
dictionary in logn time. 

B. Closest pair: To find the closest pair of numbers, sort the numbers and then do 
a linear time scan through the sorted list. Total time required (including sorting): 
nlogn. 

C. Selection: What is the kt? largest item in a set? Sort the set and look at the kt? 
position. 


Two general principles of algorithms are divide and conquer and randomization. 
Divide and conquer is the principle of dividing the original problem into several 
smaller ones, solving the smaller problems, then combining the solutions of the 
smaller problems into a solution of the original problem. This is typically possible 
in a recursive fashion. Randomization is the principle of randomizing the input 
data in order to ensure with high probability that a given algorithm’s good average 
behavior is utilized, as opposed to (possibly) much worse worst-case behavior. 

A sorting algorithm called mergesort is a good example of divide and conquer. 
The data to be sorted is split into two subpiles, each of which is then sorted. 


MERGESORT 


a Sy 
MERGE SORT 
1 l 
EEGMR ORST 
GMR EE ORST (15.6) 
MR EEG ORST 
R EEGM ORST 
R EEGMO RST 
EEGMORR ST 


EEGMORRST 


We then merge the two sorted subpiles by comparing the first (lowest) elements in 
each sorted subpile. The lowest of the two elements is removed, leaving the next 
lowest element as the lowest element in one of the two subpiles, and so on. This 
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merging process takes on the order of n steps, and the recursion into smaller subpiles 
takes on the order of logn, so the total running time of this algorithm is n logn. 
Quicksort is another sorting algorithm, in which we pick an arbitrary element x 
of the data set. The rest of the data is separated into the elements larger than x 
and the elements smaller than x, and each of these subsets is sorted. The complete 
sorted set is then obtained by merging the results with x inserted in between. 


QUICKSORT 


wf DN 

ICKO Q USRT (15.7) 
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CIKO Q RSTU 


CIKOQRSTU 


The total cost is on the order of nlogn steps on average since x is more likely to 
be closer to the center of the sorted set than to the edges. However, if x happens to 
be at either end, the number of steps will be more like n?. To ensure the average 
good behavior is obtained with high probability, we use randomization along with 
quicksort. 

If we know more about the distribution of the data, we can apply more specialized 
algorithms like distribution sort. The key point to note here is that the specialized 
algorithms may perform much worse if our hypothesis about the distribution 
happens to be incorrect. 





15.4 Dynamic Programming 


Dynamic algorithms (Denardo, 2003) are algorithms which solve problems by 
solving and storing the solutions to small problems, and then combining these 
solutions into a solution of the larger problem. There is, of course, a trade-off: 
memory is traded for speed. The memory requirements must be kept in mind 
for such algorithms. An important feature of dynamic programming algorithms 
is optimal sub-structure: the sub-problems which are the solution to the problem 
posed are themselves optimal solutions. In other words, all future steps depend 
only on which state the algorithm is in, not on how the algorithm got to that state. 
Dynamic programming is, in a sense, the opposite of recursion. 

Thus, the entire focus in looking for a dynamic programming solution is to 
establish what are the appropriate steps in the solution, what are the decisions at 
any step, and what are the states that are associated with each step. The decision 
at any step must determine the next state, given the state you are in. Dynamic 
programming is particularly useful in cases where there is a natural ordering to each 
input such as the left-to-right ordering of bases in a DNA sequence. The reason is 
that the number of partial solutions found must stay bounded. If the order of the 
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input did not matter, there would be an exponential explosion in possible states 
which we would not be able to store in memory. The traveling salesman problem, 
for example, has no inherent order in the vertices, and results in an exponential 
sized set of states. 

For example: Fibonacci numbers are defined by a simple two-term relation: 
fn = fn-1 + fn—2, with fo = 0, fı = 1. They can be computed recursively, 
but in exponential time because each recursive step branches into more recursive 
steps: fn = fn-2 + 2fn—3 + fn-4 and so on. This requires no storage. A dynamic 
programming algorithm, directly iterating over the definition, runs in linear time, 





but stores the last two values computed as it runs. As another example, binomial 
coefficients can be computed recursively 


Coat +0" > (15.8) 


(with Cm = 1 if m = 0 or n = m) in exponential time, or dynamically from the 


last computed row of Pascal’s triangle 


1 
1 1 
1 2 1 
1 3 3 1 
1 4 6 4 1 
1 5 10 10 5 1 


in n? time, but using n integers to store the last computed row. 

Dynamic programming is a standard technique in sequence analysis, for example: 
given two sequences of symbols, find the longest subsequence of symbols that 
appears in both the given sequences. This problem takes about mn steps, where m,n 
are the lengths of the two sequences and uses two m x n arrays to store the partial 
results. From the perspective of biological modeling, a more interesting application 
of dynamic programming is for stochastic control. Model-free control theory may 
be particularly interesting as a way to make progress in predicting the response of 
a biological system in the absence of a complete model of its biochemistry. A recent 
review of applications of dynamic programming to stochastic control is Lee and Lee 
(2004). 





15.5 Graphs 


A graph is, simply put, a set of relationships between objects. Networks of protein 
interactions are graphs, with the relationship being the evidence for interaction 
between two proteins, for example from a yeast-2-hybrid screeen. Many complex 
systems can be described in terms of the relationships between their parts. Hence 
graph theory appears prominently in many attempts to find common structures 
in large-scale data sets, for example expression array measurements, and is likely 
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to become even more prominent as dynamical models are expressed as graphs of 
interactions and molecular species. 

Several common graph problems are directly relevant to biology. For example, 
one might want to know if a graph of interactions remains connected if one deletes a 
certain set of interactions (edges in the graph) (Jeong et al., 2000). One might want 
to find the number of paths of length less than 5 connecting two given proteins. One 
might want to find the set of shortest paths that still connect all the proteins—this 
corresponds to a minimal spanning tree. In computer science, a graph with costs 
associated with the edges is termed a network. In biological examples, one might 
consider the — log(probability) of a given interaction being a true positive as the 
cost associated with the interaction. Then one might want to consider the cheapest 
paths from protein A to protein B. 

The solutions to many graph problems require the use of a class of algorithms 
termed greedy, in which a solution can be found by using only knowledge possessed 
at the time the next choice is made. A characteristic of these algorithms is that they 
are short-sighted, in other words, they take the step that seems intuitively to be the 
best one for the next step, but eventually their steps converge to the best solution. 
From a heuristic perspective this makes greedy algorithms intuitive, but it is often 
difficult to prove that they actually will lead to a correct solution without getting 
trapped in a sub-optimal solution. For example, finding the shortest path that goes 
through all the nodes in a graph (the traveling salesman problem) might intuitively 
require adding the nearest node to the path at any given step, but this will not 
usually lead to a solution of the problem. Greedy algorithms are particularly useful 
in problems with an exponentially or factorially growing search space, for example 
the number of possible interaction graphs for n proteins, or the graph of models 
obtained by elaborating or simplifying reactions in a dynamical model (chapter 4 
and chapter 11). Searching in the latter graph is an important problem in biological 
system identification. 

Examples of greedy algorithms are the algorithms that find the solution to the 
minimum spanning tree problem, the problem of finding the subgraph in a graph 
such that every node in the graph is a node in the subgraph, and there are no 
cycles (closed loops) in the subgraph. For sparse graphs, which are graphs with the 
number of nodes roughly equal to the number of edges, the best algorithm runs in 
about nloglogn steps. Are graphs of biological interest sparse? Scale-free graphs 
have the number of edges roughly proportional to the number of nodes, for example. 

Many uses for information present in a graph require a traversal of the graph. 
Such traversals are usually depth-first, that is, visit all nodes attached to a node 
attached to a starting node, before visiting all the nodes attached to a second node 
attached to the starting node and so on (as numbered below), 
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1 
/ \ 

2 T 
LN /\ 
3 4 8 9 

/ \ 
5 6 
or breadth-first (as numbered below), 
1 
/ \ 

2 3 
AN /\ 
4 5 6 7 
/ \ 
8 9 


that is, visit all the nodes attached to the starting node first, then proceed outwards 
to the next-nearest neighbors and so on. Both these traversals of graphs take on the 
order of (number of nodes + number of edges) time steps, but different applications 
require different traversals. For example, traversing the graph of possible models 
mentioned above should probably be a breadth-first search, at least initially. 





15.6 Search 


Search algorithms depend on the search problem (Russell and Norvig, 2003). Search- 
ing amounts to locating an item in a set by probing elements in the set. In detail, it is 
a protean problem: the probe may be inaccurate (experimental uncertainty); probes 
may have unequal costs (for example, protein mass spectroscopy versus expression 
measurements); the search space may be infinite or very large (for example, param- 
eter optimization for a large system of differential equations); the item to be located 
may not be uniquely identifiable (for example, there may be two dynamical models 
consistent with the data); resources for the search may be limited (for example, 
there may be a limited amount of extract available for expression measurements); 
or some combination of all these characteristics. Accordingly, there are many dif- 
ferent types of search algorithms, most of which are variants or combinations of a 
few basic strategies. 

The simplest strategy might be termed generate-and-test. It is usually simple 
to implement, and it will clearly find the solution. It may, however, take a long 
time to find the solution in a problem with even a modest amount of complexity. 
Improvements on this algorithm include hill-climbing (adding a heuristic distance 
function to guide the generation of solutions that minimize the distance from the 
desired state) and adding stochasticity (avoiding entrapment in a local minimum 
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of the distance function by adding some stochasticity to the distance function, and 
reducing the amplitude of the stochasticity as the search proceeds). 

An improved combination of depth-first search and breadth-first search is called 
best-first search. The search problem is formulated as a graph. Every node in the 
graph represents a model. A queue of nodes is set up. The current node is at the 
head of the queue. If the current node is not the solution of the search problem, 
all its nearest neighbor nodes are added to the queue, and the current node is 
removed from the queue. The queue is re-ordered according to a (possibly heuristic) 
distance function. The lowest node in the queue is made the current node and the 
algorithm repeats. Thus the search always tries the best move available from its 
list of possible moves, regardless of whether the move is a horizontal or vertical 
move. Only horizontal moves (breadth-first search) avoid getting trapped in dead 
ends but may end up searching all the nodes, while only vertical moves (depth-first 
search) avoid searching all the nodes but can get trapped in dead ends. Best-first 
search avoids the pitfalls of both breadth-first and depth-first searches by hopping 
around in the search graph between areas more likely to contain the desired model. 

Perhaps the most popular search algorithm is Ax search. A difference between 
best-first and Ax search is that the heuristic function is the sum of two contributions. 
One is an underestimate of the distance from the current node to the goal of the 
search, and the other measures the distance from the current node to the putative 
next node. The heuristic behind this summation is that this sum (by the triangle 
inequality) is an estimate of the distance from the putative next node to the goal, the 
underestimation in the heuristic function compensating for the triangle inequality 
approximately. Having chosen a new current node, if the node chosen is not the 
goal, then this node and all its nearest neighbors are removed from the queue, 
another difference from best-first search. All these search algorithms have a worst- 
case performance proportional to the number of nodes in the graph, but Ax search 
will out-perform the others by about an order of magnitude on typical problems. 

For the iterative modeling that is needed in systems biology, another search 
heuristic is useful: means-ends analysis. This allows both backwards and forwards 
searching, so it is possible to iteratively refine models from gross features to more 
detailed features. It should be obvious that any application of this algorithm to 
modeling requires detailed biological input, so we give here only a rough sketch of 
the search strategy. The strategy examines the current state (the present dynamical 
model), the desired state (the present data), and the differences between them. The 
difference is used to iterate over adding additional interactions to the model that 
may bring the model closer to the desired predictions. If the data is organized in 
a hierarchical fashion, the resulting model will have a hierarchical structure. The 
model remains, of course, a model, in that the actual organization of biochemical 
interactions in the cell may not mirror the mathematical interactions incorporated 
into the model. 

In search problems with uncertainty, it is often interesting to pose the prob- 
lem as a constraint satisfaction problem (chapter 5). Multiple alignments of DNA 
sequences are an example of constraint satisfaction problems, with biological knowl- 
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edge guiding which alignments are good. Another example would be to apply con- 
straint satisfaction heuristics to modeling the response of various related cell-lines 
to different stimuli—we do not want totally different models for each cell-line. We 
would like the models to be roughly “aligned,” just as in the sequence alignment 
problem, with biology dictating the meaning of a good alignment—for example, 
the timescales for corresponding parts of a particular response may be different 
between cell-lines so models which reproduce this difference should be regarded as 
“aligned.” The constraints in this example might be modeled as statements off the 
form: “The time at which transcription factor X concentration rises precedes the 
time at which genes with upstream binding sites for X are expressed.” 





15.7 Identification, Simulation and Optimization of Dynamical Models 


Dynamical models of biological systems usually have unknown parameters such as 
reaction rates and initial concentrations for molecules not observed. The experi- 
mental data is used to constrain such unknown parameters. This process typically 
involves searching through different values of the unknown parameters, solving the 
dynamical system with a given set of values of these parameters, then adjusting the 
values of these parameters trying to improve the fit to experimental data. 

Dynamical systems come in many varieties, ranging from entirely stochastic de- 
scriptions of molecular dynamics (chapter 8 and chapter 16) to systems of ordinary 
differential equations expressing smooth variation (chapter 6 and chapter 12). If we 
have a system of m differential equations governing the dynamics of m molecular 
concentrations that we need to fit to T data points, with a certain time resolu- 
tion, the number of computational steps required is roughly proportional to mT, 
with a proportionality constant depending on the accuracy required and the com- 
putational complexity of the system of equations. If every species interacts with 
every other species, the computational complexity may be of the order of m or 
higher, but if only a few molecules interact with any given molecule with simple 
Michaelis-Menten dynamics, the complexity will be independent of m. 

These two extremes reflect a major consideration in modeling. One has a choice: 
one can model a large number of molecular species with a few interactions per 
species, or one can model only a few marker species with complicated interactions 
between possibly all the other marker species. Given a good understanding of the 
fundamental interactions, it would appear from this discussion that the crossover 
point between the two approaches occurs at about (number of markers)? ~ (number 
of molecules). These are, however, not the only considerations. One almost never 
has detailed and accurate knowledge of the fundamental interactions of molecules 
in vivo so available data is used for finding likely values of reaction rates, which 
adds greatly to the computational burden. In addition, one usually cannot obtain 
many time-points for a large number of molecules due to resource constraints, 
which implies that the data used to constrain the reaction rates has low dynamical 
resolution. Another problem is that incomplete biochemical information may lead 
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to finding a model that matches the data but only because important reactions 
are unknown—predictions for the effects of novel perturbations from such a model 
may be wrong. On the other hand, if one opts for using more accurately and more 
frequently sampled marker concentrations, one has to infer the whole complex of 
effective interactions, since many fundamental interactions are subsumed into the 
set of signals propagating from one marker molecule to another. This means that 
many more effective reaction rates need to be determined per marker molecule. 
Furthermore, one has to have enough biological knowledge to select the marker 
species, or factor in the cost of exploratory experiments to determine markers. 

From a larger perspective, modelers are interested in deducing dynamics from 
data (chapter 11). Experimental time series data is often noisy, reflecting both 
underlying stochasticity and experimental systematic variation. Noise reduction 
in time series is an extensively studied topic (Kostelich and Schreiber, 1993). 
A considerable part of the literature deals with systems characterized as low- 
dimensional chaos which is not generally the case for biological systems. Rarely 
does a biologist have the resources to make measurements over many periods of the 
asymptotic behavior of a given system. Nevertheless, some techniques have broader 
applicability. 

One of the most flexible and robust techniques is a variant on the Takens time- 
delay-embedding method (Kostelich and Schreiber, 1993), usually called singular 
spectrum analysis (Golyandina et al., 2001). In this procedure, the available time 
series x;,i = 1,...,.N is used to construct a set of vectors v;, i = 1,...,N—L+1 
defined as 


Vi = (Ti, Tipa - - - Ci4n-1) (15.9) 


The v vectors are used to construct a matrix M of size (N — L + 1) x L with the 
vectors serving as the columns of M. A Hankel matrix is a matrix with entries 
that are constant along anti-diagonals, so M is an example of a Hankel matrix. 
A singular value decomposition of M results in M = Yı + Y2 +... with each X; 
corresponding to a particular singular value. While this is an exact decomposition, 
the noise reduction is achieved by using only the Y; that correspond to the largest 
few singular values and add up to an approximately Hankel matrix. The final 
step is to add the chosen Y; and derive a noise-reduced time series by averaging 
over the anti-diagonal elements, thereby inverting the process by which M was 
initially formed from the experimental time series x;. The value of this technique 
is that it works without assuming anything about the underlying signal, unlike 
Fourier analysis or other noise filters based on specific special functions. The only 
assumption in singular spectrum analysis is the choice of the window length L. A 
rough rule of thumb is to pick a value of L in a range where small changes in L do 
not affect the noise-reduced time series, while keeping N — L considerably larger 
than L. 

Time series data has an advantage over a permutable data set, in that there is 
a definite time ordering to the data. How does one exploit the time ordering to 


328 


Computational Constraints on Modeling in Systems Biology 


help in the noise reduction? A simple approach, adapted from a procedure applied 
in chaotic systems (Kostelich and Schreiber, 1993), is applicable if the time series 
experiment has been repeated three times or more: We can use an interpolation of 
the form £n = Gnn—1%n—1 Fan ntin + bp to find coefficients Gp n—1,Gn,n+1; bn 
(for example, by a least-squares fit to the replicates) and use these coefficients to 
compute a noise-reduced value ĉn for the time series. The procedure can even be 
carried out with only two samples in conjunction with a Bayesian marginalization 
(chapter 4) over the coefficients. This procedure reduces the noise in the time series 
for all save the first and last points. The caveats are that the interpolation may not 
be an accurate description of the dynamics, and there may be some sensitivity to 
outliers in the data if fewer repetitions of the data are available. 

Optimization of parameters is a search problem dealing with continuous families 
of parameters. As such, more methods are available to guide the search procedure. 
Optimization algorithms are broadly divided into searches that take derivatives of 
an objective cost function to find paths in the parameter space that minimize the 
cost, and searches that do not require derivatives of the cost function. Often in 
biological modeling, one does not know the character of the landscape associated 
with the cost function. If the cost varies smoothly with the parameter values simple 
ideas like going down the path of steepest descent may find the minimum. More 
likely, the cost function may have multiple local minima, and getting trapped in the 
vicinity of a local minimum will trap local optimization strategies such as steepest 
descent. A more global approach to finding a minimum of the cost function uses 
simulated annealing, where the size of a step in parameter space that the algorithm 
takes as it searches for a global minimum is gradually reduced so that the algorithm 
avoids getting trapped in a wrong local minimum. Simulated annealing is much 
slower than gradient descent type methods but is much more likely to avoid false 
minima. Deterministic global optimization methods like branch and bound are 
computationally too demanding for most interesting biological modeling problems. 
Evolutionary strategies are search strategies that apply the biological example of 
evolution to searching by evolving a population consisting of different points in the 
parameter space according to how the points lower the cost function. A detailed 
description of the algorithms is beyond the scope of this chapter, but results by 
Moles et al. (2003) suggest that such strategies are the only ones capable of finding 
the true minimum in biological modeling. Since evolutionary strategies are generally 
the best suited to multimodal optimization problems, a possible implication is that 
cost function landscapes in biological systems may be multimodal. However, it is 
important to note that von Dassow et al. (2000) found that within the Drosophila 
stripe formation module, the parameter optimization problem was remarkably easy 
to solve, suggesting a quite different picture of the cost function landscape, provided 
that one has the right model. This suggests that model search should include ease of 
optimization as a criterion. A speculative hypothesis is that preferring a model that 
is easy to optimize is a computational analog of one kind of evolutionary pressure. 
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15.8 Summary 


It may not be entirely obvious that a given problem of biological interest has 
a solution expressed as an algorithm studied before in an abstracted setting. 
Nevertheless, it is extremely important to make sure that there is no relevant 
abstract problem in computer science that has been studied prior to embarking 
on one’s own reinvention of the wheel. As the examples given in this chapter show, 
brute force large-scale computation to solve problems in systems biology is a still- 
born endeavor. Even worse, ideas with conceptual merit may be dismissed as 
computationally impractical if one does not look for efficient algorithms suited to 
the task. 

An important consideration for modeling is the idea of relaxing the requirements 
in order to find approximate solutions in a reasonable amount of time. Given that 
a lot of biological information is uncertain or unavailable, probabilistic algorithms 
are a useful tool. These algorithms will find the correct answer usually, but they 
will always give an answer quickly. A simple example of a probabilistic approach 
in system inference, for example, is to generate a variety of different models 
constrained by available knowledge, and without parameter optimization, just 
check the qualitative agreement between the data and the models. One can then 
do the computationally expensive step of parameter optimization for the best of 
the generated models, if needed. The point here is that we made no attempt to 
exhaustively enumerate and test all the possible models, given a set of hypotheses. 
We figuratively threw a bunch of models all at once at the experimental data and 
picked the model that came closest for further evaluation. We could be wrong, and 
this would be something we could check by repeatedly running this algorithm and 
comparing the common features of the selected models. A probabilistic model is 
not always right, while a proven algorithm is not always fast. 

For the scale of problems of interest in systems biology, for the foreseeable future 
there will be no such thing as “unlimited computational power.” This does not 
imply that systems biology is impossible. It does imply that the results obtained 
by computer scientists in the past must be utilized just as much as known biology 
in order to make effective use of the large-scale data sets now becoming available. 
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In chemical systems formed by living cells, the small numbers of molecules of a few 
reactant species can result in dynamical behavior that is discrete and stochastic, 
rather than continuous and deterministic (McAdams and Arkin, 1999, 1997; Arkin 
et al., 1998; Elowitz et al., 2002; Fedoroff and Fontana, 2002). By “discrete,” we 
mean the integer-valued nature of small molecular populations, which makes their 
representation by real-valued (continuous) variables inappropriate. By “stochastic,” 
we mean the random behavior that arises from the lack of total predictability in 
molecular dynamics. In this chapter we introduce some concepts and techniques 
that have been developed for mathematically describing and numerically simulat- 
ing chemical systems that take proper account of discreteness and stochasticity. 
Throughout, we shall make the simplifying assumption that the system is well- 
stirred or spatially homogeneous. In practice this assumption is often justified, and 
it allows us to specify the state of the system simply by giving the molecular popu- 
lations of the various chemical species. But in some circumstances the well-stirred 
assumption will not be justified; then the locations of the molecules and the dy- 
namics of their movement must also be considered. Some approaches to this more 
computationally challenging situation are described in chapter 8. 





16.1 Chapter Overview 


We begin in section 16.2 by outlining the foundations of “stochastic chemical 
kinetics” and deriving the chemical master equation (CME), the time-evolution 
equation for the probability function of the system’s state. Unfortunately, the CME 
cannot be solved, either analytically or numerically, for any but the simplest of 
systems. But we can generate numerical realizations (sample trajectories in state 
space) of the stochastic process defined by the CME by using a Monte Carlo strategy 
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called the stochastic simulation algorithm (SSA). The SSA is derived and discussed 
in section 16.3. Although the SSA is an ideal algorithm in the sense that it provides 
exact realizations of the CME, there is a computational price for this: Because the 
SSA simulates every reaction event, it will be painfully slow for systems that involve 
enormous numbers of such events, which most real chemical systems do. This has 
motivated a search for algorithms that give up some of the exactness of the SSA in 
return for greater simulation speed. 

One such approximate accelerated algorithm is known as tau-leaping, and it is 
described in section 16.4. In tau-leaping, instead of advancing the system to the 
time of the next reaction event, the system is advanced by a pre-selected time 7, 
which typically encompasses more than one reaction event. The number of times 
each reaction fires in time T is approximated by a Poisson random variable, and we 
explain why that can be done in section 16.4. In section 16.5 we show how, under 
certain conditions, tau-leaping further approximates to a stochastic differential 
equation called the chemical Langevin equation (CLE), and then how the CLE can 
in turn sometimes be approximated by an ordinary differential equation called the 
reaction rate equation (RRE). Tau-leaping, the CLE, and the RRE are successively 
coarser-grained approximations which usually become appropriate as the molecular 
populations of the reacting species are made larger and larger. 

In the past, virtually all chemically reacting systems were analyzed using the 
deterministic RRE, even though that equation is accurate only in the limit of 
infinitely large molecular populations. Near that limit though, the RRE practically 
always provides the most efficient description. One reason for this is the extensive 
theory that has been developed over the years for efficiently solving ordinary 
differential equations, especially those that are stiff. A stiff system of ordinary 
differential equations is one that involves processes occurring on vastly different 
time scales, the fastest of which is stable. Stiff RREs arise for chemical systems 
that contain a mixture of fast and slow reactions, and many if not most cellular 
systems are of this type. The practical consequence of stiffness is that, even though 
the system itself is stable, naive simulation techniques will be unstable unless they 
proceed in extremely small time steps. In section 16.6 we describe the problem 
of stiffness in a deterministic (RRE) context, along with its standard numerical 
resolution: implicit methods. 

Given the connections described above between tau-leaping, the CLE, and the 
RRE, it should not be surprising that stiffness is also an issue for tau-leaping 
and the CLE. In section 16.7 we describe an implicit tau-leaping algorithm for 
stochastically simulating stiff chemical systems. We conclude in section 16.8 by 
describing and illustrating yet another promising algorithm for dealing with stiff 
stochastic chemical systems, which we call the slow-scale SSA. 
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16.2 Foundations of Stochastic Chemical Kinetics and the Chemical Master 


Equation 
We consider a well-stirred system of molecules of N chemical species {$1,..., SN} 
interacting through M chemical reaction channels {R1,..., Rm}. The system is 


assumed to be confined to a constant volume Q, and to be in thermal (but 
not necessarily chemical) equilibrium at some constant temperature. With X;(t) 
denoting the number of molecules of species S; in the system at time t, we want to 
study the evolution of the state vector X(t) = (Xi(t),...,Xw(#)), given that the 
system was initially in some state X(to) = xo. 

Each reaction channel R; is assumed to be “elemental” in the sense that it de- 
scribes a distinct physical event which happens essentially instantaneously. Elemen- 
tal reactions are either unimolecular or bimolecular; more complicated chemical re- 
actions (including trimolecular reactions) are actually coupled sequences of two or 
more elemental reactions. 

Reaction channel R; is characterized mathematically by two quantities. The first 
is its state-change vector vj = (V1;,-..,UNj) , Where vij is defined to be the change 
in the S; molecular population caused by one R; reaction; thus, if the system is in 
state x and an R; reaction occurs, the system immediately jumps to state x + vj. 
The array {vij} is commonly known as the stoichiometric matrix. 

The other characterizing quantity for reaction channel R; is its propensity func- 
tion aj. It is defined so that a;(x) dt gives the probability, given X(t) = x, that 
one Rj reaction will occur somewhere inside Q in the next infinitesimal time in- 
terval [t,t + dt). This probabilistic definition of the propensity function finds its 
justification in physical theory (Gillespie, 1992b,a). If R; is the unimolecular re- 
action S; — products, the underlying physics is quantum mechanical, and implies 
the existence of some constant cj such that a;(x) = cjz;. If Rj is the bimolecular 
reaction S; + Sy — products, the underlying physics implies a different constant 
cj, and a propensity function a;(x) of the form c,x;x, if i #7’, or C5 505 (25 — 1) if 
i =1'. The stochasticity of a bimolecular reaction stems from the fact that we do 
not know the precise positions and velocities of all the molecules in the system, so 
we can predict only the probability that an S; molecule and an Sy molecule will 
collide in the next dt and then react according to Rj. 

It turns out that c; for a unimolecular reaction is numerically equal to the reaction 
rate constant kj of conventional deterministic chemical kinetics, while c; for a 
bimolecular reaction is equal to k;/Q if the reactants are different species, or 2k; /Q 
if they are the same (Gillespie, 1976, 1992b,a). But it would be wrong to infer from 
this that the propensity functions are simple heuristic extrapolations of the rates 
used in deterministic chemical kinetics; in fact, the inference flow actually goes the 
other way. The existence and forms of the propensity functions follow directly from 
molecular physics and kinetic theory, and not from deterministic chemical kinetics. 

The probabilistic nature of the dynamics described above implies that the most 
we can hope to compute is the probability P(x,t|xo,to) that X(t) will equal x, 
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given that X(to) = xo. We can deduce a time-evolution equation for this function 
by using the laws of probability to write P(x, t+ dt | xo, to) as: 


M 
P(x,t + dt|xo, to) = P(x, t| xo, to) x [1 — X. a;(x)dt] 
j=l 
M 
+ > P(x = Vj,t | Xo, to) x a;(x = vj)dt 
j=1 


The first term on the right is the probability that the system is already in state 
x at time t, and no reaction of any kind occurs in [t,t + dt). The generic second 
term is the probability that the system is one Rj reaction removed from state x 
at time t, and one Ry reaction occurs in [t,t + dt). That these M + 1 routes from 
time ¢t to state x at time t + dt are mutually exclusive and collectively exhaustive 
is ensured by taking dt so small that no more than one reaction of any kind can 
occur in [t,t + dt). Subtracting P(x, t| Xo, to) from both sides, dividing through by 
dt, and taking the limit dt — 0, we obtain (McQuarrie, 1967; Gillespie, 1992b) 


OP(x, t | XQ, to) 
Ot 


M 
= L la; (x — v;)P(x — vj, t|xo, to) — aj (x) P(x, t| x0, to)]| (16.1) 


This is the chemical master equation (CME). In principle, it completely determines 
the function P(x,t|xo,to). But the CME is really a set of nearly as many coupled 
ordinary differential equations as there are combinations of molecules that can exist 
in the system. So it is not surprising that the CME can be solved analytically 
for only a very few very simple systems, and numerical solutions are usually 
prohibitively difficult. 

One might hope to learn something from the CME about the behavior of averages 
like (f (X(t))) = 3°, f(x) P(x,t| xo, to), but this too turns out to pose difficulties 
if any of the reaction channels are bimolecular. For example, it can be proved from 
equation 16.1 that 


d(Xi(t)) _ Š 
ae > vig (aj (X(6))) (6 =1,..-,.N) 


If all the reactions were unimolecular, the propensity functions would all be linear 
in the state variables, and we would have (a, (X(t))) = a, ((X(t))). The above 
equation would then become a closed set of ordinary differential equations for the 
first moments, (X;(t)). But if any reaction is bimolecular, the right hand side will 
contain at least one quadratic moment of the form (X;(t)Xj(t)) , and the equation 
then becomes merely the first of an infinite, open-ended set of coupled quations for 
all the moments. 
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In the hypothetical case that there are no fluctuations, we would have (f (X(t))) = 
f (X(t)) for all functions f. The above equation for (X;(t)) would then reduce to 





AX;(t) _ S\vijaj(X(t)) (@=1,...,N) (16.2) 


This is the reaction rate equation (RRE) of traditional deterministic chemical 
kinetics—a set of N coupled first-order ordinary differential equations for the X;(t), 
which are now continuous (real) variables. The RRE is more commonly written in 
terms of the concentration variables X;(t)/Q, but that scalar transformation is 
inconsequential for our purposes here. Examples of RREs in a biological context 
abound in Chapter 6. 

Although the deterministic RRE would evidently be valid in the absence of 
fluctuations, it is not clear what the justification and penalty might be for ignoring 
fluctuations. We shall later see how the RRE follows more deductively from a series 
of physically transparent approximating assumptions to the stochastic theory. 





16.3 The Stochastic Simulation Algorithm 


Since the CME (eq. 16.1) is rarely of much use in computing the probability density 
function P(x,t|xo,to) of X(t), we need another computational approach. One 
approach that has proven fruitful is to construct numerical realizations of X(t), 
that is, simulated trajectories of X(t)-versus-t . This is not the same as solving the 
CME numerically, as that would give us the probability density function of X(t) 
instead of samplings of that random variable. However, much the same effect can 
be achieved by either histogramming or averaging the results of many realizations. 
The key to generating simulated trajectories of X(t) is not the CME or even the 
function P(x,t|xo,to), but rather a new function, p(T, j |x, t) (Gillespie, 1976). It 
is defined so that p(r,7j|x,t)dr is the probability, given X(t) = x, that the neat 
reaction in the system will occur in the infinitesimal time interval [t +7, t +r + dr), 
and will be an R; reaction. Formally, this function is the joint probability density 
function of the two random variables “time to the next reaction” (7) and “index of 
the next reaction” (j). 

To derive an analytical expression for p(T, j |x,t), we begin by noting that if 
Po(7|x,t) is the probability, given X(t) = x, that no reaction of any kind occurs 
in the time interval [t,t + 7), then the laws of probability imply the relation 


p(t, j |x,t)dr = Po(r|x,t) x aj(x)dr 


The laws of probability also imply 


M 
Pa(T + dr|x,t) = Po(r| x,t) x [L— X ay (x)dr] 
j'=1 
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An algebraic rearrangement of this last equation and passage to the limit dr — 0 
results in a differential equation whose solution is easily found to be Po(7|x,t) = 
exp (—ao(x) T), where 


M 


a(x) = X` a; (x) (16.3) 


gti 


When we insert this result into the equation for p, we get 
p(T, j |x, t) = aj(x) exp (—ao(x) 7) (16.4) 


Equation 16.4 is the mathematical basis for the stochastic simulation approach. It 
implies that the joint density function of 7 and j can be written as the product of the 
T-density function, ao(x) exp (—ao(x)r), and the j-density function, a;(x)/ao(x). 
We can generate random samples from these two density functions by using the 
inversion method of Monte Carlo theory (Gillespie, 1992a). Draw two random 
numbers rı and r2 from the uniform distribution in the unit-interval, and select 
T and j according to 


ne aw In (=) (16.5a) 


J 
j = the smallest integer satisfying 5 aji (x) > r2 ao(x) (16.5b) 
jit 


Thus we arrive at the following version of the stochastic simulation algorithm (SSA) 
Gillespie, 1976, 1977): 


—> 


Initialize the time t = to and the system’s state x = Xo. 
With the system in state x at time t, evaluate all the a;(x) and their sum ao(x). 
Generate values for 7 and j according to equations 16.5a and b. 


Effect the next reaction by replacing t — t +T and x —x+V;. 


ee ee ee Ns = 


Record (x,t) as desired. Return to step 2, or else end the simulation. 


The X(t) trajectory that is produced by the SSA might be thought of as a 
“stochastic version” of the trajectory that would be obtained by solving the RRE . 
But note that the time step 7 in the SSA is exact and is not a finite approximation 
to some infinitesimal dt, as is the time step in most numerical solvers for the RRE. 
If it is found that every SSA-generated trajectory is practically indistinguishable 
from the RRE trajectory, then we may conclude that microscale fluctuations are 
ignorable. But if the SSA trajectories deviate noticeably from the RRE trajectory, 
then we must conclude that microscale fluctuations are not ignorable, and the 
deterministic RRE does not provide an accurate description of the system’s real 
behavior. 


16.4 Tau-Leaping 
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The SSA and the CME are logically equivalent to each other; yet even when the 
CME is completely intractable, the SSA is quite straightforward to implement. The 
problem with the SSA is that it is often very slow. The source of this slowness can 
be traced to the factor 1/ao(x) in the T equation 16.5a: ao(x) can be very large if 
the population of one or more reactant species is large, and that is often the case 
in practice. 

There are variations on the above method for implementing the SSA that make 
it more computationally efficiency (Gibson and Bruck, 2000b; Cao et al., 2004a). 
But any procedure that simulates every reaction event one at a time will inevitably 
be too slow for most practical applications. This prompts us to look for ways of 
giving up some of the exactness of the SSA in return for greater simulation speed. 





16.4 Tau-Leaping 


One approximate accelerated simulation strategy is tau-leaping (Gillespie, 2001). 
It advances the system by a pre-selected time T which encompasses more than one 
reaction event. In its simplest form, tau-leaping requires that 7 be chosen small 
enough that the following leap condition is satisfied: The expected state change 
induced by the leap must be sufficiently small that no propensity function changes 
its value by a significant amount. 

We recall that the Poisson random variable P (a, T) is by definition the number of 
events that will occur in time 7 given that adt is the probability that an event will 
occur in any infinitesimal time dt, where a can be any positive constant. Therefore, 
if X(t) = x, and if we choose 7 small enough to satisfy the leap condition, so that 
the propensity functions stay approximately constant during the leap, then reaction 
Rj should fire approximately P; (a;(x),7) times in [t,t +7). Thus, to the degree 
that the leap condition is satisfied, we can leap by a time 7 simply by taking 

M 
X(t+7) =x +X v; P; (a;(x),7) (16.6) 


j=1 


Doing this evidently requires generating M Poisson random numbers for each leap 
(Press et al., 1986). It will result in a faster simulation than the SSA to the degree 
that the total number of reactions leapt over, D P; (a;(x),7), is large compared 
to M. 

In order to use this simulation technique efficiently, we obviously need a way 
to estimate the largest value of 7 that is compatible with the leap condition. 
One possible way of doing that (Gillespie and Petzold, 2003) is to estimate the 
largest value of r for which no propensity function is likely to change its value 
during t by more than £ao(x), where e (0 < € < 1) is some pre-chosen accuracy- 
control parameter. Whatever the method of selecting 7, the (explicit) tau-leaping 
simulation procedure goes as follows (Gillespie, 2001; Gillespie and Petzold, 2003): 
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1. In state x at time t, choose a value for 7 that satisfies the leap condition. 


2. For each j = 1,...,M, generate the number of firings kj of reaction Rj in time 
T as a sample of the Poisson random variable P (a;(x),7). 


3. Leap, by replacing t — t +7 and x — x + > kjvj. 
j=l 

In the limit that 7 — 0, tau-leaping becomes mathematically equivalent to the 
SSA. But tau-leaping also becomes very inefficient in that limit, because all the kj’s 
will approach zero, giving a very small time step with usually no reactions firing. 
As a practical matter, tau-leaping should not be used if the largest value of 7 that 
satisfies the leap condition is less than a few multiples of 1/ao(x), the expected 
time to the next reaction in the SSA, since it would then be more efficient to use 
the SSA. 

Tau-leaping has been shown to significantly speed up the simulation of some 
systems (Gillespie, 2001; Gillespie and Petzold, 2003). But it is not as foolproof as 
the SSA. If one takes leaps that are too large, bad things can happen; for example, 
some species populations might be driven negative. If the system is stiff, meaning 
that it has widely varying dynamical modes with the fastest mode being stable, 
the leap condition will generally limit the size of 7 to the time scale of the fastest 
mode, with the result that large leaps cannot be taken. Stiffness is very common in 
cellular chemical systems and will be considered in more detail later. 

It is tempting to try to formulate a “higher-order” tau-leaping formula by extend- 
ing higher-order ODE methods in a straightforward manner for discrete stochastic 
simulation. However, doing this correctly is much harder than it might at first ap- 
pear. Most such extensions are not even first order accurate for the stochastic part 
of the system. An analysis of the consistency, order, and convergence of tau-leaping 
methods is given by Rathinam et al. (2005), where it is shown that the tau-leaping 
method defined above, and the “implicit” tau-leaping method to be described in 
section 16.7, are both first-order accurate as T > 0. 
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Reaction Rate Equation 


Suppose we can choose 7 small enough to satisfy the leap condition, so that 
approximation 16.6 is good, but nevertheless large enough that 


a,;(x)7 >>1 forallj=1,...,M (16.7) 


Since a,;(x)7 is the mean of the random variable P; (a;(x),7), the physical signifi- 
cance of condition 16.7 is that each reaction channel is expected to fire many more 
times than once in the next 7. It will not always be possible to find a 7 that satisfies 
both the leap condition and condition 16.7, but it usually will be if the populations 
of all the reactant species are sufficiently large. 
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When condition 16.7 does hold, we can make a useful approximation to the tau- 
leaping formula 16.6. This approximation stems from the purely mathematical fact 
that the Poisson random variable P(a,7), which has mean and variance ar, can be 
well approximated when at > 1 by a normal random variable with the same mean 
and variance. Denoting the normal random variable with mean m and variance o? 


by N(m, 7), it thus follows that when condition 16.7 holds, 
P; (ag (x), 7) = Nj (aj (x)7, a5 (x)r) = aj (x)7 + (ay(x)7)"/7.NG (0, 1) 


the last step following from the fact that M(m,o?) = m + oN (0, 1). Inserting this 
approximation into equation 16.6 gives (Gillespie, 2000, 2002) 


M M 
X(t+7)=x+ 2- vjaj(x)r + >, vja; (x)N; (0, 1)VT (16.8) 


where the N;(0,1) are statistically independent normal random variables with 
means 0 and variances 1. Equation 16.8 is called the Langevin leaping formula. 
It evidently expresses the state increment X(t +7) — x as the sum of two terms: 
a deterministic drift term proportional to 7, and a fluctuating diffusion term 
proportional to yT. It is important to keep in mind that equation 16.8 is an 
approximation, which is valid only to the extent that 7 is (i) small enough that no 
propensity function changes its value significantly during 7, yet (ii) large enough 
that every reaction fires many more times than once during T. The approximate 
nature of equation 16.8 is underscored by the fact that X(t) therein is now a 
continuous (real-valued) random variable instead of a discrete (integer-valued) 
random variable; we lost discreteness when we replaced the integer-valued Poisson 
random variable with a real-valued normal random variable. The Langevin leaping 
formula 16.8 gives faster simulations than the tau-leaping formula 16.6 not only 
because condition 16.7 implies that very many reactions get leapt over at each 
step, but also because the normal random numbers that are required by equation 
16.8 can be generated much more easily than the Poisson random numbers that are 
required by equation 16.6 (Press et al., 1986). 

The “small-but-large” character of 7 in equation 16.8 marks that variable as 
a “macroscopic infinitesimal.” If we subtract x from both sides and then divide 
through by 7, the result can be shown to be the following (approximate) stochastic 
differential equation, which is called the chemical Langevin equation (CLE) (Gille- 
spie, 2000, 2002): 


— = » vj aj (X(t) +X viy a; (XH) (16.9) 


The T;(t) here are statistically independent “Gaussian white noise” processes sat- 
isfying (I(t) T} (t')} = ôjj 6(t — t’), where the first delta function is Kronecker’s 
and the second is Dirac’s. The CLE (equation 16.9) is mathematically equivalent to 
the Langevin leaping formula (equation 16.8), and is subject to the same conditions 
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for validity. Stochastic differential equations arise in many areas of physics, but the 
usual way of obtaining them is to start with a macroscopically inspired drift term 
(the first term on the right side of the CLE) and then assume a form for the diffu- 
sion term (the second term on the right side of the CLE) with an eye to obtaining 
some pre-conceived outcome. So it is noteworthy that our derivation here of the 
CLE did not proceed in that ad hoc manner; instead, we used careful mathematical 
approximations to infer the forms of both the drift and diffusion terms from the 
premises underlying the CME/SSA. 

Molecular systems become “macroscopic” in what physicists and chemists call the 
thermodynamic limit. This limit is formally defined as follows: the system volume 
Q and the species populations X; all approach oo, but in such a way that the 
species concentrations X;/Q all remain constant. The large molecular populations in 
chemical systems near the thermodynamic limit generally mean that such systems 
will be well described by the Langevin formulas 16.8 and 16.9. To discern the 
implications of those formulas in the thermodynamic limit, we evidently need to 
know the behavior of the propensity functions in that limit. It turns out that all 
propensity functions grow linearly with the system size as the thermodynamic limit 
is approached. For a unimolecular propensity function of the form cjg; this behavior 
is obvious, since c; will be independent of the system size. For a bimolecular 
propensity function of the form c;x;x this behavior is a consequence of the fact 
that bimolecular c;’s are always inversely proportional to Q, reflecting the fact that 
two reactant molecules have a harder time finding each other in larger volumes. 

It follows that, as the thermodynamic limit is approached, the deterministic drift 
term in equation 16.8 grows like the size of the system, while the fluctuating diffusion 
term grows like the square root of the size of the system, and likewise for the CLE. 
This establishes the well known rule-of-thumb in chemical kinetics that relative 
fluctuation effects in chemical systems typically scale as the inverse square root of 
the size of the system. 

In the full thermodynamic limit, the size of the second term on the right side of 
equation 16.9 will usually be negligibly small compared to the size of the first term, 
in which case the CLE reduces to the RRE. Thus we have derived the RRE as a 
series of limiting approximations to the stochastic theory that underlies the CME 
and the SSA. The tau-leaping and Langevin-leaping formulas evidently provide a 
conceptual bridge between stochastic chemical kinetics (the CME and SSA) and 
conventional deterministic chemical kinetics (the RRE), enabling us to see how the 
latter emerges as a limiting approximation of the former. 





16.6 Stiffness in Deterministic Reaction Rate Equations 


Stiffness can be defined roughly as the presence of widely varying time-scales in 
a dynamical system, the fastest of which is stable. It poses special problems for 
the numerical solution of both deterministic ordinary differential equations (ODEs) 
and stochastic differential equations (SDEs), particularly in the context of chemical 
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kinetics. Stiffness also impacts both the SSA and the tau-leaping algorithm equation 
16.6. In this section we will describe the phenomenon of stiffness for deterministic 
systems of ODEs, and show how it restricts the timestep size for all “explicit” 
methods. Then we will show how the use of “implicit” methods overcomes this 


restriction. 
Consider the deterministic ODE system 
dx 
— =f(t 16.10 
= = ftx) (16.10) 


In simplest terms, this system is said to be “stiff" if it has a strongly damped, or 
“superstable” mode. To get a feeling for this concept, consider the solutions x(t) 
of an ODE system starting from various initial conditions. For a typical nonstiff 
system, if we plot a given component of the vector x-versus-t we might get a family 
of curves resembling those shown in figure 16.1a: The curves either remain roughly 
the same distance apart as t increases, as in the figure, or they might show a 
tendency to merge very slowly. But when such a family of curves is plotted for a 
typical stiff system, the result looks more like what is shown in figure 16.1b: The 
curves merge rapidly to one or more smoother curves, with the deviation from the 
smoother curves becoming very small as ¢ increases. 
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Figure 16.1 A system of ODEs is said to be “stiff” if its solutions show strongly damped 
behavior as a function of the initial conditions. The family of curves shown in (a) represents 
the behavior of solutions to a nonstiff system for various initial conditions. In contrast, 
solutions to the stiff system shown in (b) tend to merge quickly. 


Stiffness in a system of ODEs corresponds to a strongly stable behavior of the 
physical system being modeled. At any given time the system will be in a sort 
of equilibrium, although not necessarily a static one, and if some state variable 
is perturbed slightly, the system will respond rapidly to restore the equilibrium. 
Typically, the true solution x(t) of the ODE system does not show any rapid 
variation, except possibly during an initial transient phase. But the potential for 
rapid response is always present and will manifest itself if we perturb x out of 
equilibrium. A stiff system has (at least) two time scales. There is a long (slow) 
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timescale for the quasi-equilibrium phase, and a short (fast) timescale for the 
transient phase following a perturbation. The more different these two time scales 
are, the stiffer the system is said to be. 

The smallest (fastest) timescale in a stiff system manifests itself in another way 
when we try to carry out a numerical solution of the system. Solution by an explicit 
time stepping method, such as the simple explicit Euler method 


Xa = X70 F T£(tn—1,Xn-1) (16.11) 


where tn = tn—-1 +7 and xn is the numerical approximation to x(t,,), will produce 
very inaccurate results unless the time stepsize T is kept smaller than the smallest 
time scale in the system. 

To see why this is so, let us consider a simple example: the reversible isomerization 
reaction, S1 2 Sy. Let ap denote the (constant) total number of molecules of 


c2 


the two isomeric species, and x(t) the time-varying number of Sı molecules. The 
deterministic RRE for this system is the ODE 





— = cgt + c2(ap — T) = —(c1 + c2)£ +27 (16.12) 


The solution to this ODE for the initial condition 2(0) = xo, is given by 


a(t) _ CTT i (x CTT jee 





C1 + C2 Ci + C2 


From the form of this solution, we can see that if the initial value xo differs from the 





asymptotic value EE the solution will relax to that asymptotic value in a time 


of order (cı +¢2)~+; therefore, if (c1 + c2) is very large, this system will be stiff. In 
figure 16.2 we show the exact solution of the reversible isomerization reaction 16.12 
along with numerical solutions obtained using the explicit Euler method (equation 
16.11) with two different stepsizes 7. Note that the smaller stepsize Euler solution 
is accurate, but the larger stepsize solution is unstable. 

To see why this instability arises, let us write down the explicit Euler method 
(equation 16.11) with stepsize 7 for the ODE (equation 16.12): 


Ly = En—1 — T(C1 + C2) En-1 + TOQeT (16.13) 
If we expand the true solution x(t) in a Taylor series about tn—1, we get 
a(tn) = 2(tn—1) — T(e1 + C2) 2(tn_1) + Te2z7 + O(7°) (16.14) 
Subtracting 16.14 from 16.13, and defining the error en = £n — £ (tn), we obtain 
en = €n—-1 — T(c1 + €2)en—1 + O(7”) (16.15) 
Thus, e, is given by the recurrence formula 


en = (1 — T(c1 + c2))en-1 + O(7") (16.16) 
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Figure 16.2 Exact solution of equation 16.12 (solid line) and its explicit Euler approxi- 
mations for stepsizes 0.2 (asterisks) and 1.1 (triangles) with c1 = c2 = 1 and zr = 2x 10°. 
The fast time constant for this problem is (cı + c2)! = 0.5. 





If r > 2(c, + c)~1, then |1 — r(c1 + c2)| will be greater than 1, and we will 
have |e,| > |eén—i]. The recurrence will then be unstable. In general, to ensure the 
stability of an explicit method, we must restrict the stepsize to the timescale of the 
fastest mode, even though much larger stepsizes might seem perfectly acceptable 
for getting an adequate resolution of the solution curve. 

The restriction of the explicit Euler method to timesteps 7 that are on the order 
of the short (fast) timescale makes the method very slow for stiff systems. So it is 
natural to ask if there are other solution methods for which the timesteps are not 
restricted by stability, but only by the need to resolve the solution curve. It is now 
widely recognized that a general way of doing this is provided by implicit methods 
(Ascher and Petzold, 1998), the simplest of which is the implicit Euler method. For 
the ODE (equation 16.10), it reads 


Xn = Xn-1 + TE(tn, Xn) (16.17) 


In contrast to the explicit Euler formula (equation 16.11), this method is implicit 
because x,, is not defined entirely in terms of past values of the solution; instead, it 
is defined implicitly as the solution of the (possibly nonlinear) equation 16.17. We 
can write this system abstractly as 


F(u) =0 (16.18) 


where u = xn and F(u) = u— xp_1 — Tf (tn, uU). Usually, the most efficient way to 
numerically solve equation 16.18 is by Newton iteration: One iterates the formula 


(F) [u+ w™)] = -F(u™) (16.19) 


th 


over m, where u’”) is the m*” iterated approximation to the exact root of F, and the 


Jacobian matrix F /ðu is evaluated at u™), This is a linear system of equations, 
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which is to be solved at each iteration for u’*t!. Newton’s method converges in one 
iteration for linear systems, and the convergence is quite rapid for most nonlinear 
systems given a good initial guess. The initial guess is usually obtained by evaluating 
a polynomial that coincides with recent past solution values at tn. In practice, 
the Jacobian matrix is usually not reevaluated at each iteration; also, it is often 
approximated by numerical difference quotients rather than evaluated exactly. The 
use of an approximate Jacobian matrix that is fixed throughout the iteration is 
called modified Newton iteration. On first glance, it might seem that the expense 
of solving the nonlinear system at each time step would outweigh the advantage 
of increased stability; however, this is usually not so. For stiff systems, implicit 
methods are usually able to take timesteps that are so much larger than those of 
explicit methods that the implicit methods wind up being much more efficient. 

To see why the implicit Euler method does not need to restrict the step size to 
maintain stability for stiff systems, let us consider again the reversible isomerization 
reaction (equation 16.12). For it, the implicit Euler method reads (cf. equation 
16.13) 


Ln = Zn—1 —T(C1 + Co) an + TOUT (16.20) 
Expanding the true solution in a Taylor series about tn, we get (cf. equation 16.14) 
t(tn) = £(tn—1) — T(c1 + c2)£(tn) + Te2£r + O(T°) (16.21) 


Subtracting 16.21 from 16.20, we find that the error en = £n — x(tn) now satisfies 
(cf. equation 16.15) 


en = €n—1 — T(C1 + €2)en + O(7”) (16.22) 
Solving this for en, we get 


En—-1 2 
on = igt) (16.23) 
In contrast to the error eq. 16.16 for the explicit Euler method, the error for the 
implicit Euler method remains small for arbitrarily large values of T(c, + c2), as 
seen in figure 16.3. 
For the general ODE system (eq. 16.10), the negative eigenvalues of the matrix 
J = Of /Ox play the role of (cı + c2). For stiff systems, the eigenvalues À of J will 
include at least one with a relatively large negative real part, corresponding in the 
case of an RRE to the fastest reactions. The set of complex numbers TÀ satisfying 
|1+7A| < 1 is called the region of absolute stability of the explicit Euler method. 
The corresponding region for the implicit Euler method is given by 1/|1—7A| < 1, 
and it will be much larger. 
A great deal of work has gone into the numerical solution of stiff systems of 
ODEs (and of ODEs coupled with nonlinear constraints, called differential algebraic 
equations (DAEs)). There is extensive theory and highly efficient and reliable 
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Figure 16.3 The implicit Euler method overcomes a weakness of the explicit Euler 
method in that it does not need to restrict the step size to provide stable solutions for stiff 
systems. The figure shows the true solution of the deterministic reversible isomerization 
reaction 16.12 (solid line), and the numerical solution by the implicit Euler method for 
stepsizes 0.2 (asterisks) and 1.1 (triangles) with cı = co = 1 and xr = 2 x 10°. Note 
the excellent agreement, in contrast to the case with the explicit Euler method shown in 
figure 16.2. 


software which adapts both the method order and the timestep to the given 
problem. See Ascher and Petzold (1998), for more details. 





16.7 Stiffness in Stochastic Chemical Kinetics: The Implicit Tau-Leaping Method 


When stochasticity is introduced into a chemical system that has fast and slow 
time scales, with the fast mode being stable as before, we may still expect there 
to be a slow manifold corresponding to the equilibrium of the fast reactions. But 
stochasticity changes the picture in a fundamental way: once the system reaches 
the slow manifold, naturally occurring fluctuations will drive it back off, leading to 
persistent random fluctuations transverse to the slow manifold. If these fluctuations 
are negligibly small, then an implicit scheme which takes large steps (on the time 
scale of the slow mode) will do just fine. But if the fluctuations off the slow 
manifold are noticeable, then an implicit scheme that takes steps much larger than 
the time scale of the fast dynamics will dampen the fluctuations, and thus fail to 
reproduce them correctly. Fortunately, this failing can usually be corrected by using 
a procedure called down-shifting, which we will describe shortly. 

The original tau-leaping method (equation 16.6) is explicit because the propensity 
functions a; are evaluated at the current (known) state, so the future (unknown) 
random state X(t + 7) is given as an explicit function of X(t). It is this explicit 
nature of equation 16.6 that leads to stability problems when stiffness is present, 
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just as with ordinary differential equations. One way of making the explicit tau- 
leaping formula 16.6 implicit is to modify it as follows (Rathinam et al., 2003): 


M 
X(t-+7) = X(t) +> vja; (X(t+7))T 
g M 
+ v7 [Pi(aj(X(t)),7) -XO (16.24) 


Since the random variables P;(a,;(X(t),7) here can be generated without knowing 
X(t+7), then once values for those random variables are set, equation 16.24 becomes 
an ordinary implicit equation for the unknown state X(t+7), and X(t +7) can 
then be found by applying Newton iteration to equation 16.24. 

Just as the explicit tau method segues to the explicit Euler methods for SDEs 
and ODEs, the implicit tau method segues to the implicit Euler methods for SDEs 
and ODEs. In the SDE regime we get, approximating Poissons random variables 
by normal random variables, the implicit version of the Langevin leaping formula : 


M M 
X(t +7) = X(t) + TL v;a;(X(t+7)) + > v;4/a;(X(t))Nj(0,1)/7 (16.25) 


Here, the M4 (0, 1) are, as in eq. 16.8, independent normal random variables with 
mean zero and variance 1. And in the thermodynamic limit, where the random 
terms in eq. 16.25 may be ignored, it reduces to the implicit Euler method 


M 
X(t+7) = X(t) +r vjay(X(t+7)) (16.26) 


for the deterministic RRE . 

We noted earlier that the implicit tau method, when used with a relatively large 
timestep, will dampen the natural fluctuations of the fast variables. Thus, although 
the implicit tau-leaping method computes the slow variables with their correct 
distributions, it computes the fast variables with the correct means but with spreads 
about those means that are too narrow. Fortunately, a time-stepping strategy called 
down-shifting can restore the overly-damped fluctuations in the fast variables. The 
idea is to interlace the implicit tau-leaps, each of which is on the order of the 
time scale of the slow variables and hence “large,” with a sequence of much smaller 
time steps on the time scale of the fast variables, these being taken using either 
the explicit tau method or the SSA. This sequence of smaller steps “regenerates” 
the correct statistical distributions of the fast variables. Further details on implicit 
tau-leaping and down-shifting can be found in Rathinam et al. (2003). 


16.8 Stiffness in Stochastic Chemical Kinetics: The Slow-Scale SSA 347 





16.8 Stiffness in Stochastic Chemical Kinetics: The Slow-Scale SSA 


Another way to deal with stiffness in stochastic systems is to use the recently 
developed (Cao et al., 2005) slow-scale SSA (ssSSA). The first step in setting up the 
ssSSA is to divide (and reindex) the M reaction channels {Ri,..., Rm } into fast and 
slow subsets, {Ri,..., Riy, } and {Rj,..., Riy,}, where Me+ Ms = M. We initially 
do this provisionally (subject to possible later change) according to the following 
criterion: the propensity functions of the fast reactions, a paa ah t Should usually 
be very much larger than the propensity functions of the slow reactions, aj, ..., @y.- 
The broad result of this partitioning will be that the time to the occurrence of the 
next fast reaction will usually be very much smaller than the time to the occurrence 
of the next slow reaction. 

Next we divide (and reindex) the N species {5),...,S)} into fast and slow 
subsets, { sf, coy St} and TS; T 5%} where N; + Ns = N. This gives rise to 
a like partitioning of the state vector X(t) = (X‘(t), XS(t)), and also the generic 
state space variable x = ear x’), into fast and slow components. The criterion for 
making this partitioning is simple: a fast species is any species whose population 
gets changed by some fast reaction; all the other species are called slow. Note the 
asymmetry in this definition: a slow species cannot get changed by a fast reaction, 
but a fast species can get changed by a slow reaction. Note also that af 


j 
both depend on both fast and slow variables. The state-change vectors can now be 


and ay can 


re-indexed 
| ff ff — y 
Be = (Vigoss VN) Ye emer fs 
È fs fs ss ss - 
v; = (vE -o VN Vo UNG) a J = less Me 


where vee denotes the change in the number of molecules of species S7 (o = f, s) 
induced by one reaction R$ (p =f, s). We can regard vf as a vector with the same 
dimensionality (Nf) as Xf, because ust = 0 (slow species do not get changed by fast 
reactions). 

The next step in setting up the ssSSA is to introduce the virtual fast process 
X‘(t). It is composed of the same fast species state variables as the real fast process 
X(t), but it evolves only through the fast reactions; that is, X‘(t) is X(t) with all 
the slow reactions switched off. To the extent that the slow reactions don’t occur 
very often, we may expect X‘(t) and X‘(t) to be very similar to each other. But 
from a mathematical standpoint there is an profound difference: X‘(t) by itself 
is not a Markov (past-forgetting) process, whereas X‘(t) is. Since the evolution 
of X(t) depends on the evolving slow process X(t), X‘(t) is not governed by a 
master equation of the simple Markovian form (equation 16.1); indeed, the easiest 
way to find X‘(t) would be to solve the Markovian master equation for the full 
process X(t) = (X‘(t), X*(¢)), which is something we have tacitly assumed cannot 
be done. But for the virtual fast process X‘(t), the slow process X‘(t) stays fixed at 
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some constant initial value xĝ; therefore, xii +) evolves according to the Markovian 
master equation, 


aP(x', t | XQ, to) 
Ot 


Me 
=Y [af (xt - v5, x5) Plo! — vf, txo, to) — af (xf, x5) Ê (xf, t xo, to) 
I 


wherein Ê(xf,t | xo, to) is the probability that X‘(t) = xf, given that X(to) = xo. 
This master equation for X‘(t) will be simpler than the master equation for X(t) 
because it has fewer reactions and fewer species. 

Finally, in order to apply the ssSSA, we require that two conditions be satisfied. 
The first condition is that the virtual fast process R(t) be stable, in the sense that 
it approaches a well defined, time-independent random variable X! (00) as t > 00; 
thus, we require the limit 


jim P(x, t| xo, to) = P(x!, 00 | xo) 


to exist. P(xf, oo |xo) can be calculated from the stationary form of the time- 
dependent master equation, 


o=) [ah ex! — vf, x5) Plat — vf f 00|x0) — af (xf, x$) P(x!, 00 | xo) 


which will be easier to solve since it is purely algebraic. The second condition we 
impose is that the relaxation of X‘(t) to its stationary asymptotic form X‘(0o) 
happen very quickly on the time scale of the slow reactions. More precisely, we 
require that the relaxation time of the virtual fast process be very much less than 
the expected time to the next slow reaction. 

These two conditions will generally be satisfied if the system is stiff. If satisfying 
them can be accomplished only by making some changes in the way we originally 
partitioned the reactions into fast and slow subsets, then we do that, regardless 
of propensity function values. But if these conditions cannot be satisfied, we must 
conclude that the ssSSA is not applicable. 

Given the forgoing definitions and conditions, it is possible to prove the slow-scale 
approximation(Cao et al., 2005): if the system is in state (xf, x5) at time t, and if 
Ag is a time increment that is very large compared to the relaxation time of X‘(t) 
but very small compared to the expected time to the next slow reaction, then the 
probability that one RÌ reaction will occur in the time interval [t,t + As) can be 
well approximated by aj (x5; x!) As, where 


ic ,00 | xf, x") aš (x, x") (16.28) 
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We call aj (x®; x‘) the slow-scale propensity function for reaction channel R} because 
it serves as a propensity function for R; on the timescale of the slow reactions. 
Mathematically, it is the average of the regular R} propensity function over the 
fast variables, treated as though they were distributed according to the asymptotic 
virtual fast process X‘(00). 

The slow-scale SSA is an immediate consequence of this slow-scale Approxima- 
tion. The idea is to move the system forward in time in the manner of the SSA one 
slow reaction at a time, updating the fast variables after each step by randomly 
sampling X‘(oo) (Cao et al., 2005). 

To illustrate how the ssSSA works, consider the simple reaction set 


Si = Sy 2 S (16.29) 
c2 
under the condition 
c2 > C3 (16.30) 


Here, an Sy molecule is most likely to change into an Sı molecule, a change that 
is relatively unimportant since it will eventually be reversed. On rare occasions, 
though, an Sj molecule will instead change into an S3 molecule, a potentially 
more important change since it is irreversible. This simple model has been used to 
help understand certain features of the heat shock response mechanism in E. Coli 
(ElSamad and Khammash, 2006). Roughly, S2 can be thought of as the active 
form of an enzyme which either gets deactivated via reaction Rə (and subsequently 
reactivated via reaction Rı), or gets bound to a DNA promoter site via reaction 
Rz to allow the transcription of an important gene. In the heat shock model, we 
are particularly interested in the case in which the average number of S molecules 
is very small, even less than 1. 

We shall take the fast reactions to be Rı and Re, and the slow reaction to be R3. 
Then the fast species will be Sı and S2, and the slow species $3. The virtual fast 
process X‘(t) will be the Sı and Sz populations undergoing only the fast reactions 
Rı and Rə. Unlike the real fast process, which gets affected whenever Rg fires, the 
virtual fast process obeys the conservation relation 


R(t) + X2(t) =a (constant) (16.31) 


This relation greatly simplifies the analysis of the virtual fast process, since it 
reduces the problem to a single independent state variable. 

Eliminating X(t) in favor of X; (t) by means of equation 16.31, we see that given 
Xi (t) = 2, Xi (t + dt) will equal x, — 1 with probability c,a/dt, and a, + 1 with 
probability c>(£r — «’,)dt. X,(t) is therefore what is known mathematically as a 
“bounded birth-death” Markov process. It can be shown (Gillespie, 2002) that this 
process has, for any initial value x, € [0,27], the asymptotic stationary distribution 

A xy! 


P(x}, œ| £r) = 
(x1, | ) x! (ap — z4) 





(a), (x1 =0,1,...,2r) (16.32) 
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where q = c2/(c1 + c2). This tells us that X; (o0) is the binomial random variable 
B(q, £r), whose mean and variance are given by 





(X1(00)) = arg = (16.33a) 
Cy + C2 
a = _ C1 CQrT 
var {X1(c0)} =2rrq(1—q)= (a +a)? (16.33b) 


It can also be shown (Cao et al., 2005) that X: (t) relaxes to X: (00) in a time of 
order (c1 + ¢2)7?. 

The slow scale propensity function for the slow reaction Rg is, according to 
equation 16.28, the average of a3(x) = c3x2 with respect to X‘(oo). Therefore, 
using equations 16.31 and 16.33a, 


€3C1 (£1 + x2) 


(16.34) 
C1 + C2 


ā3 (£3; £1, £2) = C3 (X2(00)) = 
Since the reciprocal of a3(£3; £1, %2) estimates the average time to the next Rs 
reaction, the condition that the relaxation time of the virtual fast process be very 
much smaller than the mean time to the next slow reaction is 


arap Sse (16.35) 
C1 + C2 





This condition will be satisfied if the inequality 16.30 is sufficiently strong. In that 
case, the slow-scale SSA for reactions 16.29 goes as follows: 


1. Given X (to) = (£10, £20, £30), set t — to and Ti — Tio (i = 1,2;3). 
2. In state (x1, 72,273) at time t, compute ā3(£3; £1, £2) from equation 16.34. 


3. Draw a unit-interval uniform random number r, and compute 


ao) 
T= In 
a3 (#3; £1, £2) r 


4. Advance to the next Rg reaction by replacing t — t+ 7 and 





v3 — £3 + 1l, £2 — z2 — 1 


With rp = z1 + T2 


xı — sample of B ( a sær) 
Cy + C2 





LQ — LP = 171 
5. Record (t, £1, 22,23) if desired. Then return to step 2 or else stop. 


In step 4, the x3 update and the first x2 update actualize the R3 reaction. The 
bracketed procedure then “relaxes” the fast variables in a manner consistent with the 
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Figure 16.4 Two simulations of reactions 16.29 using the parameter values 16.36. Graph 
(a) shows an exact SSA run in which the populations are plotted essentially after each Rg 
reaction (see text for details). Over 23 million reactions make up this run, the overwhelming 
majority of which are Rı and Rə reactions. Graph (b) shows an approximate ssSSA run in 
which only Ra reactions, which totaled 587, were directly simulated, and the populations 
are plotted after each of those. The ssSSA simulation ran over 1,000 times faster than the 
SSA simulation. 


stationary distribution in equation 16.32 and the new value of xy. See Press et al. 
(1986), for a way to generate samples of the binomial random variable B(q, xr). 

Figure 16.4a shows the results of an exact SSA run of reactions 16.29 for the 
parameter values 


cı = 10, c =4x 104, cs =2; 219 = 2000, x29 = £30 = 0 (16.36) 
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The Sı and S3 populations here are plotted out immediately after each Rg reaction. 
The S population, which is shown on a separate scale, is plotted out at a like 
number of equally spaced time intervals; this gives a more typical picture of the S2 
population than plotting it immediately after each Rg reaction because Rg reactions 
are more likely to occur when the S2 population is larger. 

For the parameter values 16.36, condition 16.35 is satisfied by 4 orders of 
magnitude initially, and even more so as the total population of Sı and S2 declines; 
therefore, this reaction set should be amenable to simulation using the slow-scale 
SSA. Figure 16.4b shows the results of such a simulation, plotted after each Rs 
reaction. We note that all the species trajectories in this approximate ssSSA run 
agree very well with those in the exact SSA run of figure 16.4a; even the behavior 
of the sparsely populated species Sə is accurately replicated by the ssSSA. But 
whereas the SSA run in figure 16.4a had to simulate over 23 million reactions, the 
slow-scale SSA run in figure 16.4b simulated only 587 reactions, with commensurate 
differences in their computation times. 





16.9 Concluding Remarks 


In this chapter we have discussed two broad themes. The first is the “logical bridge” 
that connects the chemical master equation (CME) and stochastic simulation 
algorithm (SSA) on one side with the reaction rate equation (RRE) on the other 
side. Under the well-stirred (spatially homogeneous) assumption, the CME/SSA 
provides a mathematical description that is exact, discrete, and stochastic. If the 
system is such that the leap condition can be satisfied, the CME/SSA can be 
approximated by the Poissonian tau-leaping formula (equation 16.6) to obtain a 
description that is approximate, discrete, and stochastic. Further, if the reactant 
populations are large enough that the Poissonian tau-leaping formula can be 
approximated by the Gaussian tau-leaping formula (equation 16.8), which in turn 
is equivalent to the chemical Langevin equation (CLE) (equation 16.9), we obtain 
a description that is approximate, continuous, and stochastic. And finally, in the 
thermodynamic limit of an infinitely large system, the random terms in the CLE 
usually become negligibly small compared to the deterministic terms, and the CLE 
reduces to the RRE, which is approximate, continuous, and deterministic. 

This progression—from the CME and SSA to tau-leaping to the CLE to the 
RRE—in which each successive level is an approximation of the preceding level, 
would, along with the corresponding numerical methods at each level, give us all the 
tools we need to efficiently simulate spatially homogeneous systems were it not for 
the multiscale nature of most biochemical systems: Both the species populations and 
the rates of the various chemical reactions typically span many orders of magnitude. 
As a consequence, in most cases the system as a whole does not fit efficiently into one 
level of description exclusive of the others. The second theme of our development in 
this chapter has been to describe two strategies for coping with multiscale problems: 
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implicit tau-leaping, and the slow-scale SSA. But much more remains to be done 
on the problem of multiscale. 
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Until recently, the majority of computational models in biology were implemented 
in custom programs and published as statements of the underlying mathematics. 
However, to be useful as formal embodiments of our understanding of biological 
systems, computational models must be put into a consistent form that can be 
communicated more directly between the software tools used to work with them. 
In this chapter, we describe the Systems Biology Markup Language (SBML), a 
format for representing models in a way that can be used by different software 
systems to communicate and exchange those models. By supporting SBML as an 
input and output format, different software tools can all operate on an identical 
representation of a model, removing opportunities for errors in translation and 
assuring a common starting point for analyses and simulations. We also take this 
opportunity to discuss some of the resources available for working with SBML as 
well as ongoing efforts in SBML’s continuing evolution. 





17.1 Introduction 


The chapters of this book testify to the rising importance of computational modeling 
in biological research as a means of helping to better understand biological function. 
The increasing interest in this approach, coupled with our modern ability to 
generate ever-more complex models more rapidly than ever before, make it clear 
that practical computational modeling requires the use of software tools. 

Until recently, the majority of models were implemented in custom programs 
and published only as statements of the underlying mathematics (that is, intended 
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for human consumption). However, to be useful as formal embodiments of our 
understanding of biological systems (Bower and Bolouri, 2001), computational 
models must be put into a consistent form that can be communicated more directly 
between the software tools used to work with them. This format must help overcome 
a number of problems facing a systems biologist: 


= Users often need to work with complementary resources from multiple software 
tools in the course of a project because different tools have different strengths and 
capabilities. For example, one tool may have a good model editing interface, another 
tool may provide novel facilities for analyzing system properties, yet another may 
implement an advanced simulation capability but lack a good graphical interface, 
etcetera. If the tools do not share a common model storage format, users are forced 
to re-encode their models in each tool separately, a time-consuming and error-prone 
practice. 


m Models published in peer-reviewed journals are sometimes accompanied by in- 
structions for obtaining the definitions in electronic form. However, because each 
author may use a different software environment (and associated model representa- 
tion language), such definitions are often not straightforward to examine, test, and 
reuse. Researchers who wish to use a published model typically must transcribe it 
manually into a format compatible with their particular software. 


m When simulation software packages are no longer supported, models developed 
in those systems can become stranded and unusable. This has already happened 
on a number of occasions, with a resulting loss of usable models to the research 
community. Continued innovation and development of new tools will only aggravate 
this problem unless the issue of standard formats is addressed. 


= Reuse of existing models requires that those models can be clearly identified, easily 
retrieved, and related to their published descriptions in the scientific literature. 
Moreover, because of the increasing size and complexity of models continually being 
developed, the model structure should be documented to allow for efficient handling 
and sound modification. 


We developed the Systems Biology Markup Language (SBML) in an effort to 
address these problems. SBML is a format for representing computational models 
in a way that can be used by different software systems to communicate and 
exchange those models (Finney and Hucka, 2003; Hucka et al., 2003, 2004). By 
supporting SBML as an input and output format, different software tools can all 
operate on an identical representation of a model, removing opportunities for errors 
in translation and assuring a common starting point for analyses and simulations. 
SBML is by no means a perfect format, but it has proven useful and achieved 
widespread acceptance within the domain of modeling at the level of biochemical 
reaction networks. Over 90 open-source and commercial software tools support 
SBML as of November 2005. 

A gratifying by-product of the SBML project has been the way it has catalyzed a 
community of interested researchers, developers, and users who are now collaborat- 


17.2 Software Assistance for Biological Modeling 357 


ing on evolving SBML and creating new resources around it. This is undoubtedly 
a reflection of an urgent need in the community for any format such as SBML to 
address issues of interoperability. At the same time, we suspect that the challenges 
faced by the SBML community and the solutions that are arising have underlying 
components that would be faced by any effort to define a similar standard exchange 
format. We discuss two examples in this chapter. One is the difficulty of balancing 
ease of language implementation against representational power. Today this is be- 
ing answered by progress towards SBML Level 2 Version 2, which is expected to 
be ratified in 2005, and the modular SBML Level 3, which is expected in 2006. A 
second is the unexpected difficulty of ensuring correct interpretation of SBML by 
different software applications. We describe our current attempts to address this 
problem using a combination of (i) a carefully-designed software library, libSBML, 
which among other features provides rule-based model consistency testing, and (ii) 
a semantic validation suite for testing correct interpretation of SBML constructs 
by software applications. 





17.2 Software Assistance for Biological Modeling 


As an example of how software technologies such as SBML assist modelers today, 
consider the following hypothetical (but still quite plausible) sequence of events. 

A computationally-savvy biologist named Albert is investigating one of the 
mitogen-activated protein kinase (MAPK) cascades. The MAPK pathways lead 
from growth factor receptors on cell membranes to effector molecules located in 
the cell cytoplasm and nucleus. This family of signaling pathways is one that 
has received much attention in both experimental (Seger and Krebs, 1995) and 
computational biology (Schoeberl et al., 2002). 

Our hypothetical biologist might begin with a body of experimental data 
gathered by himself and other members of the laboratory in which he works. 
In order to understand his experiments in the context of other data and other 
published results, he decides to develop a computational model so that he can 
integrate different sources of existing knowledge and his own hypotheses into a 
common, formalized framework. Since the MAPK system is a popular topic of 
study, he has no trouble finding related work in the literature, including existing 
computational models. He chooses to begin with a relatively simple model by 
Kholodenko (Kholodenko, 2000). The original publication gives a complete listing 
of the mathematical equations that define Kholodenko’s model, but no software 
implementation. (Even though that particular article is from this decade, it still 
predates the development of SBML and most of today’s software tools.) The model 
is not complex, but he knows that recreating a model from a research paper will 
take time, so before starting, he visits the BioModels Database (BioModels Team, 
2005) to check if the model is available in a machine-readable format. He searches 
the database and quickly finds an existing implementation (figure 17.1), which he 
can download in SBML format. 
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Figure 17.1 Screenshot of a model display page in the BioModels Database (BioModels 
Team, 2005). 


Once he has the SBML file, Albert starts up his favorite Windows-based model 
editing package, JDesigner (Sauro et al., 2003). This package provides a friendly, 
graphical diagram view of a model (figure 17.2). He spends a significant amount of 
time experimenting with the model running time-course simulations to examine the 
behavior under different conditions, as well as making modifications and exploring 
the results. After becoming familiar with the Kholodenko model, he next begins to 
make modifications based on his own experimental work and that of his colleagues. 

Eventually, Albert’s model grows and becomes substantially different from the 
original. He reaches a point where he has to find values for parameters in the 
model that are not directly measurable, but he believes he has enough converging 
evidence from other data that he can search for plausible values by a process 
known as parameter estimation. This is a resource-intensive task requiring many 
repeated simulation runs and analyses—more than he can comfortably run on his 
laptop computer. Albert enlists the aid of a colleague, Bernadette, who works 
at another institution and who has access to clusters of computers on which 
she can quickly perform large computations. Bernadette is less a biologist and 
more a computational scientist, but she has had enough exposure to biological 
modeling that she can perform the parameter estimation tasks for Albert. Despite 
the geographical distance separating them and the fact that Bernadette is adamant 
about using Linux rather than Windows as her computer operating system of choice, 
Albert has no difficulty conveying an unambiguous model definition to Bernadette 
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Figure 17.2 Screenshot of JDesigner (Sauro et al., 2003), a free computational modeling 
system for biochemical reaction networks. It runs on the Microsoft Windows operating 
system. 


because JDesigner can produce SBML output and Bernadette has at her disposal 
several software tools that can read SBML. 

Bernadette writes command scripts in Linux that take Albert’s model and his 
experimental data (which he stored in ordinary comma-delimited tabular format) 
and perform parameter estimation using an optimization package written in MAT- 
LAB (The Mathworks, Inc., 2005). To convert the SBML model into appropriate 
MATLAB data structures, she uses one of the free MATLAB toolboxes available 
for this purpose (Keating, 2005). After some iterations back and forth with Al- 
bert to clarify his goals, and many computer runs, the pair eventually determine 
best-estimate values for the unknown parameters in Albert’s model. Bernadette 
also performs a large number of additional simulation and analysis runs on her 
Linux computers using COPASI (Mendes, 2003) to explore the behaviors of the 
model. The results enable Albert to continue further with his research, comparing 
his predictions to experimental data and refining his model to incorporate new hy- 
potheses. The model and its results are novel enough that Albert writes an article 
about them with Bernadette. They also submit the SBML model to the BioModels 
Database, where the curators annotate the model and enter it into the database for 
other researchers to use and build upon. 

Some time after the article is published, a researcher working at a pharmaceutical 
company reads Albert and Bernadette’s paper on MAPK signaling. It turns out 
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Figure 17.3 Screenshot of TERANODE Design Suite (TERANODE, Inc., 2005), an 
example of a modern commercial software package using SBML and integrating model 
editing, analysis, and simulation. 


that this researcher, Carl, has been investigating novel therapeutic interventions 
on this same pathway. Thanks to the availability of the model in SBML form, 
Carl is able to quickly obtain and try out the model in his software tool of his 
choice (figure 17.3), a full-featured commercial package called TERANODE Design 
Suite (TERANODE, Inc., 2005). The model’s structure and behavior are consistent 
with his own findings, and moreover, it provides new insights that could lead to 
an investigation of new pharmacological agents. Carl is interested in pursuing this 
further. The copyright on the model stipulates that commercial users must contact 
the authors, so he contacts Albert and Bernadette and begins a promising new 
collaboration. 





17.3 The SBML Representation of Models 


The SBML project is not an attempt to define a universal language for representing 
quantitative models; the rapidly evolving views of biological function, coupled 
with the vigorous rates at which new computational techniques and individual 
tools are being developed today, are incompatible with a one-size-fits-all idea of a 
universal language. A more realistic alternative is to acknowledge the diversity of 
approaches and methods being explored by different software tool developers, and 
seek a common intermediate format—a lingua franca—enabling communication of 
the most essential aspects of the models. 
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17.3.1 Brief Summary of the Form and Features of SBML 


SBML is a machine-readable model definition language defined neutrally with re- 
spect to programming languages and software encoding. It is defined using a subset 
of UML, the Unified Modeling Language (Eriksson and Penker, 1998; Oestereich, 
1999), and in turn, this definition is used to create an XML Schema (Biron and Mal- 
hotra, 2000; Fallside, 2000; Thompson et al., 2000) for SBML. The XML Schema 
specifies how SBML can be expressed using XML, the eXtensible Markup Lan- 
guage (Bosak and Bray, 1999; Bray et al., 2000). XML is a simple and portable 
text-based substrate that has been gaining widespread acceptance in computational 
biology and bioinformatics (Achard et al., 2001; Augen, 2001). 

The main focus of SBML is encoding models consisting of biochemical entities 
(species) linked by reactions to form biochemical networks. An important principle 
in SBML is that models are decomposed into explicitly-labeled constituent elements, 
the set of which resembles a verbose rendition of chemical reaction equations. 
The representation deliberately does not cast the model directly into a set of 
differential equations or other specific mathematical frameworks. This explicit, 
modeling-framework-agnostic decomposition makes it easier for different software 
tools to interpret the model and translate the SBML form into whatever internal 
form each tool actually uses. 

SBML is being developed in levels, with each higher SBML level adding richness 
to the model definitions that can be represented by the language. Level 2 is the 
highest level of SBML currently defined; it represents an incremental evolution of 
the language resulting from the practical experiences of many users and developers 
working with Level 1 since its introduction in the year 2001. A definition of a model 
in SBML Level 2 consists of lists of one or more of the following components: 


=» compartment: a container of finite dimensions where one or more chemical sub- 
stances (well-mixed) are located; 

a species: a pool of a chemical substance located in a specific compartment (a species 
represents the concentration or amount of a substance and not a single molecule); 
= reaction: a statement describing some transformation, transport or binding pro- 
cess that can change one or more species (each reaction is characterized by the 
stoichiometry of its products and reactants and optionally by a rate equation); 

€u parameter: a quantity that has a symbolic name; 

= unit definition: a name for a unit used in the expression of quantities in a model; 
= rule: a mathematical expression that is added to the model equations constructed 
from the set of reactions (rules can be used to set parameter values, establish 
constraints between quantities, etcetera.); 

a function: a named mathematical function that can be used in place of repeated 
expressions in rate equations and other formulas; and 

= event: a set of mathematical formulas evaluated at a specified moment in the time 
evolution of the system. 
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<?xml version="1.0" encoding="UTF-8"?> 
<sbml xmlns="http://www.sbml.org/sbm1/level2" level="2" version="1"> 
<model id="EnzymeKinetics"> 
<listOfCompartments> 
<compartment id="Cell" size="1"/> 
</listOfCompartments> 
<listOfSpecies> 
<species id="S"  compartment="Cell" initialAmount="1" boundaryCondition="true"/> 
<species id="E"  compartment="Cell" initialAmount="1"/> 
<species id="ES" compartment="Cell" initialAmount="0.01"/> 
<species id="P" compartment="Cell" initialAmount="0.01" boundaryCondition="true"/> 
</listOfSpecies> 
<listOfReactions> 
<reaction id="Reactioni"> 
<listOfReactants> 
<speciesReference species="S"/> 
<speciesReference species="E"/> 
</listOfReactants> 
<listOfProducts> 
<speciesReference species="ES"/> 
</listOfProducts> 
<kineticLaw> 
<math xmlns="http://www.w3.org/1998/Math/MathML"> 
<apply> <minus/> 
<apply> <times/> <ci> k_1 </ci> <ci> S </ci> <ci> E </ci> </apply> 
<apply> <times/> <ci> k_r </ci> <ci> ES </ci> </apply> 
</apply> 
</math> 
<listOfParameters> 
<parameter id="k_1" value="3"/> 
<parameter id="k_r" value="6"/> 
</listOfParameters> 
</kineticLaw> 
</reaction> 
<reaction id="Reaction2" reversible="false"> 
<listOfReactants> 
<speciesReference species="ES"/> 
</listOfReactants> 
<listOfProducts> 
<speciesReference species="E"/> 
<speciesReference species="P"/> 
</listOfProducts> 
<kineticLaw> 
<math xmlns="http://www.w3.org/1998/Math/MathML"> 
<apply> <times/> <ci> k_2 </ci> <ci> ES </ci> </apply> 
</math> 
<listOfParameters> 
<parameter id="k_2" value="9"/> 
</listOfParameters> 
</kineticLaw> 
</reaction> 
</listOfReactions> 
</mode1> 
</sbm1> 


Figure 17.4 Simple SBML Level 2 model of a system of reactions involving enzyme 
kinetics. 


Additional features in SBML Level 2 include support for a systematic way of 
including metadata, and support for delay functions. The latter are useful for 
representing biological processes having a delayed response, but where the details 
of the processes and the actual delay mechanism are not relevant to the operation 
of the model. 

To make this discussion concrete, figure 17.4 gives the complete SBML Level 2 
listing of a simple model of enzyme kinetics, E + S = ES — P, where E, S, 
and P represent the enzyme, substrate, and product species, respectively, and ES 
is an intermediate complex formed during the reaction. In this particular SBML 
rendition, the system is represented as two reaction structures: the reversible 
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reaction H+ S = ES, here defined with a forward reaction rate of kı *[S]*[E] anda 
reverse reaction rate of kp» ES, and the irreversible reaction ES — P, here defined 
with a forward reaction rate of kə * [ES]. The symbols E, S, and ES, when used in 
rate expressions (SBML’s kineticLaw elements), stand for the concentrations of 
the different species, and the parameters kı, kr, and kə are set to values kı = 3, 
kr = 6, and k2 = 9. When specific units are omitted from quantities in an SBML 
model (as they are here), the model is assumed to use the default units for those 
quantities, which in SBML are moles for substance amounts and liters for volumes. 
Other formulations of this model might, for example, express this system explicitly 
as three irreversible reactions, change the units on quantities to be millimoles and 
microliters, and so on. This model is presented here only to give a sense for the 
structure of SBML and the relative simplicity, and we reiterate that people are not 
meant to edit models directly at this level; instead, software tools read and write 
this kind of representation on the user’s behalf. 

SBML’s representational power extends far beyond the kind of simple enzyme 
kinetics model used here as an illustration. Its simple formalisms allow a wide 
range of biological phenomena to be modeled, including cell signaling, metabolism, 
gene regulation, and more. There is no assumption about the kinds of kinetics or 
interactions or network organizations that can be represented. Significant flexibility 
and power come from the ability to define arbitrary formulas for the rates of change 
of variables as well as the ability to express other constraints mathematically. 


17.3.2 Relationships to Other Efforts 


Many XML-based formats have been proposed for representing data and models 
in biology; however, we know of only two XML-based formats that are suitable for 
representing compartmental reaction network models with sufficient mathematical 
depth that the descriptions can be used as direct input to simulation software. The 
two are SBML and CellML (Hedley et al., 2001b,a; Lloyd et al., 2004). 

CellML is built around an approach of composing systems of equations by linking 
together the variables in those equations; this is augmented by features for declaring 
biochemical reactions explicitly, as well as encapsulating arbitrary components into 
modules. Its focus is on a component-based architecture to facilitate reuse of models 
and parts of models, and the mathematical description of models. By contrast, 
SBML provides constructs that are more similar to the internal data objects 
used in many contemporary simulation/analysis software packages specialized for 
biochemical networks. 

These differences notwithstanding, the SBML and CellML efforts share much in 
common and represent somewhat different approaches to solving the same general 
problems. They were initially developed independently, but the primary developers 
of both languages are actively engaged in exchanges of ideas and are seeking ways 
of making the languages more interoperable. SBML Level 2 borrows a number of 
approaches from CellML, making it that much easier to translate between the two 
formats. 
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17.4 The Continued Evolution of SBML 


The need for a language like SBML was manifest during the first Workshop 
on Software Platforms for Systems Biology, held at the California Institute of 
Technology in early 2000. The two or three dozen attendees at the time represented 
less than a dozen software projects, yet even within this small group, it proved 
impossible to share models without having to re-encode them anew in each software 
tool. This needless impediment to collaboration directly inspired the SBML effort. 

Defining a language such as SBML and encouraging its use by other groups has 
always involved balancing conflicting demands. For example, there is pressure to 
include a wide variety of features to support the various kinds of modeling and 
analysis capabilities being explored in different tools. But if the capabilities are 
too advanced or too specialized for most tools, then few if any software packages 
will implement support for the entire language specification, with the consequence 
that most tools will still not be able to exchange models in a meaningful way. 
On the other hand, if SBML does not expand quickly enough to support features 
satisfying more advanced research efforts, then SBML risks losing the groups’ 
patience, potentially leading to the creation of incompatible dialects of the language. 

In an attempt to help achieve this balance, we are proceeding with a staged 
approach to SBML development, embodied in the already-mentioned concept of 
SBML levels. Each higher SBML level adds richness to the model definitions that 
can be represented by the language. By delimiting sets of features at incremen- 
tal stages, the SBML development process provides software authors with stable 
standards, and the community can gain experience with the language definitions 
before new features are introduced. Two levels have been defined so far (Finney 
et al., 2002; Hucka et al., 2001). Level 1 is simpler (but less powerful) than Level 2. 
The separate levels are intended to coexist; SBML Level 2 does not render Level 1 
obsolete. Software tools that cannot support higher levels can go on using lower 
levels; tools that can read higher levels are assured of also being able to interpret 
models defined in the lower levels. The open-source software infrastructure we have 
been developing around SBML (see Section 17.5) allows developers to support both 
Levels 1 and 2 in their software with a minimum amount of effort. 


17.4.1 Community Involvement 


One component of SBML’s success has been the community-oriented method of 
its continued evolution. SBML’s popularity has led to the formation of an active 
international group of researchers and software developers who are now working 
together to push SBML in new directions. As is the case with many projects 
today, the primary mode of interaction between members is electronic mail, with 
discussions taking place on the community mailing list, sbml-discussQ@caltech.edu. 
The list currently contains over 200 members coming from academic, commercial 
and private environments, from all continents. Besides discussions over the list, 
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another important mode of interaction has been regular face-to-face meetings during 
the Workshops on Software Platforms for Systems Biology (also known informally 
as the SBML Forum meeting), held since mid-2000. 

These meetings serve many vital functions. First, they provide a forum where 
proposals for potential new SBML features can be presented and where consensus 
decisions can be made about the development of SBML, with the aim of enabling 
SBML to support a wider range of model paradigms and modes of interoperability. 
Second, they ensure that systems biology software interoperability is maximized by 
discussing the correct use of SBML and (related to this) exposing software devel- 
opers to issues in the correct interpretation and handling of SBML in all software. 
Third, they inform developers of the latest developments in software infrastruc- 
ture for SBML. And finally, they educate the systems biology community about 
the range of modeling paradigms that are being used to understand biological phe- 
nomena. The ninth SBML Forum meeting was held on October 14-15, 2004, in 
Heidelberg, Germany, and was attended by 49 representatives of different interna- 
tional research groups. All presentation materials from SBML meetings are made 
publicly available on the project Web site (SBML Team, 2005b). 

In 2003, a new type of meeting was instituted: SBML Hackathons, in which 
software developers gather together to work simultaneously on their software next 
to other developers, discovering and resolving interoperability problems as they go. 
The third SBML Hackathon was held on May 9-10, 2005, at the National Museum 
of Emerging Science and Innovation in Tokyo, Japan, and was attended by 45 
delegates, nearly three times as many as attended the first SBML Hackathon in 
2003. 


17.4.2 SBML Level 2 Version 2 


As a practical consequence of how SBML develops and evolves, it reflects how the- 
oreticians and software developers conceptualize and structure their computational 
models of biochemical reaction networks. The exact form of the language matters 
less than the representational elements comprising the language. Though the incre- 
mental development path taken for SBML has led to a less-than-elegant structure, 
it is fair to say that SBML represents a consensus view of how computational 
models of reaction networks are understood today. The dedicated community of 
interested researchers has kept up the evolution of SBML and continues to result 
in improvements to meet increasingly sophisticated needs. 

The next specification of SBML is expected to be an incremental update, Level 2 
Version 2, to be followed closely by SBML Level 3, which has been in development 
for over a year. The following are illustrative of the enhancements likely to be 
introduced in SBML Level 2 Version 2 and the reasons for them. 


= Species Type. In SBML, the amount (concentration or molecular count) of every 
chemical species must be defined with respect to a location. Locations in SBML 
are represented as compartments, where a compartment can represent a physical 
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structure such as “cytoplasm” or a purely theoretical location used solely for 
modeling expediency. If the same kind of species appears in more than one location 
(for example, both inside a cell’s cytoplasm and outside the cell), this must be 
represented as two different species, each having separate identifiers in the model. 
The reason is that when SBML models are translated into typical computational 
forms, those species are represented as variables (again, either concentrations or 
molecular counts) whose values can change over time. Species located in different 
compartments are assumed to comprise different pools of the species—that is the 
logical point of having compartments in the first place. However, a number of 
software developers have expressed the need for specifying that two species variables 
in SBML refer to the same kind or type of chemical irrespective of compartmental 
location. Therefore, one of the changes planned for SBML Level 2 Version 2 is 
the introduction of a SpeciesType data structure for this purpose. This will make 
it possible for a model to define a list of SpeciesType structures. Each species 
definition will then be able to refer to a particular SpeciesType definition, stating, 
in effect, that it is “of this species type.” For example, a model could contain a 
SpeciesType for aspartate, and could have multiple species definitions, one for 
aspartate located in the cytosol and others for mitochondrial matrix compartments. 
The species representing the different pools of aspartate would have different 
identifiers (for instance, “aspartate _ cytosol” and “aspartate _ mitochon”), but each 
would refer to the common aspartate SpeciesType. 


m Nested Unit Definitions. Not all software tools provide a means of changing the 
units of measurement used for the numerical quantities in a user’s model; they 
often assume specific units for different quantities and rely on users to adjust 
numerical values as necessary when encoding models in the software environment. 
Unfortunately, sometimes different tools make different unit assumptions; thus, 
some capability for redefining units in SBML is necessary in addition to specifying 
default units. We thought that a small, simple scheme would serve best, and this is 
what we attempted in the first definition of SBML (Level 1 Version 1). The scheme 
turned out to be too limited; for example, it did not allow for the definition of 
some types of units that are not in the SI unit system, and it was significantly 
less capable than the unit scheme in CellML, making it difficult to translate some 
models between CellML and SBML. The consensus in the SBML community was 
that more definitional power was warranted, so SBML Level 2 Version 1 introduced 
a fuller unit scheme. Arguably the one feature it lacked was a provision to allow unit 
definitions to be defined in terms of other defined units rather than solely in terms 
of the base units. The reason was our continuing attempts to make the unit scheme 
simple—after all, what use is it if many software tools don’t support it? But in the 
end, the consensus of users was that it should not be up to the SBML language to 
arbitrarily limit the capabilities in this area because it impacts a researcher’s ability 
to represent their intentions precisely. The SBML community felt that tools that 
lack adequate support for units should either be enhanced appropriately, or else 
that unit manipulation functionality could be encoded in separate software tools 
and libraries such as libSBML (Section 17.5). 
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a ConstraintRule. It is sometimes important to be able to express the idea that 
certain model conditions should hold true and if, during a simulation, the conditions 
are exceeded, then the user should be alerted that the model is operating outside the 
assumptions made by the model’s author. Although SBML has always had facilities 
for expressing mathematical relationships between quantities, it lacked a provision 
for expressing these kinds of constraints. SBML Level 2 Version 2 will extend the 
types of SBML rules available to include ConstraintRule. This structure will allow 
the statement of mathematical expressions whose values evaluate to a Boolean value 
(true or false). If at some point in time during a time-course evaluation of the model, 
the expression evaluates to false, the constraint is not satisfied. The ConstraintRule 
will contain an optional note (in XHTML format) that can contain a message to be 
displayed to the user if the constraint expression evaluates to false. An example of 
the application of this rule would be to make explicit the assumptions of an Henri- 
Michaelis-Menten rate law about relative species concentrations between product 
and substrate as well as between enzyme and substrate. 


17.4.3 SBML Level 3 


As a language that is an intersection rather than a union of features needed 
by all tools, SBML currently cannot support all the representational capabilities 
that all software systems offer to users. Some tools offer features that have no 
explicit equivalent in SBML Level 2, and those tools currently can only store those 
features as annotations in an SBML model. But in many cases those features could 
potentially be used by more than one tool, and thus it would be appropriate to 
have some representation for them in SBML. Using Level 2 as a starting point, the 
SBML community has been developing proposals and prototype implementations 
of many new capabilities that will become part of SBML Level 3. The main current 
areas of interest are: 


# Diagram layout: enabling the inclusion of diagrammatic renditions of a model of 
the sort visible in the screenshots of figures 17.2 and 17.3. 
a Model composition: allowing construction of models from instances of submodels 


m Multicomponent species: allowing species to be composed from instances of 
species types, enabling such things as the representation of complexes of phos- 
phorylated proteins and generalized reactions acting on them 

a Arrays: allowing models to contain indexed collections of objects of the same type 
a Spatial features: allowing the representation of the geometric features of compart- 
ments, the diffusion rates of species and the spatial distribution of model parameters 
and boundary conditions 


= Constraints: enabling the definition of constraints on model variables 


It is unreasonable to expect a tool to support every feature planned for Level 3 in 
order to be called Level 3 compatible. One of the challenges for SBML Level 3 will 
be to design a modular feature set. The idea is to enable a model to contain explicit 
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information about which capabilities are necessary to interpret it correctly, so that 
tools encountering the model may reject it gracefully if they do not possess the 
necessary facilities. For reasons of efficiency and correctness, an explicit indication 
is preferable to requiring tools to read and interpret the entire model and inferring 
the capabilities needed. 

We anticipate that Level 3 will take the form of a core, consisting of minimal 
extensions to Level 2, and a set of Level 3 modules, each encapsulating the definition 
of one of the major features listed above. One of the extensions making up the 
Level 3 core will be explicit feature indicators, such that each of the modules has a 
corresponding feature tag which will appear in a list at the beginning of the model 
definition. The presence of a feature tag will signal to software tools reading the 
model that the model uses that particular feature. The software tool may then 
make a decision about whether it can handle the model or whether it should alert 
the user to a problem. 





17.5 Enabling Efficient and Correct Interpretation of SBML Using a Dedicated 


Software Library 


To make it easier for software developers and users to work with SBML, and more 
generally to promote the language’s use as a common exchange format, our group 
has released and continues to develop a number of open-source SBML software tools. 
Here we describe one, libSBML, that many projects are using for implementing 
support for SBML in their software applications. 


17.5.1 General Characteristics of libSBML 


LibSBML is an application programming interface (API) library for reading, writ- 
ing, and manipulating files and data streams containing SBML content. Developers 
can embed the library in their applications, saving themselves the work of imple- 
menting their own parsing, manipulation, and validation software. At the API level, 
the library provides the same interface to data structures independently of whether 
the model originated in SBML Level 1 or 2. The library currently also offers the 
ability to translate SBML Level 1 models to SBML Level 2. 

LibSBML is written in ISO standard C and C++ and is highly portable. It is 
currently supported on the Linux, Solaris, MacOS X, and Microsoft Windows oper- 
ating systems. The library provides language bindings for C, C++, Java, Python, 
Perl, MATLAB, and Common Lisp, with support for other languages planned for 
the future. We distribute the package in both source-code form and as precompiled 
dynamic libraries for the Microsoft Windows, Linux, and Apple MacOS X operating 
systems; they are available under terms of the LGPL (Free Software Foundation, 
1999) from the sbmi project on SourceForge.net (SourceForge.net, 2002), the world’s 
largest open-source software repository and project hosting service. LibSBML is at 
release version 2.3.4 as of October 2005. 
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17.5.2 Advantages of a Dedicated Library for SBML 


An often-repeated question is, why not simply use a generic XML parsing library? 
After all, SBML is usually expressed in XML, and there exist plenty of XML parsers, 
so why not simply tell people to use one of them, rather than develop a specialized 
library? The answer is: while it is true that developers can use general-purpose 
XML libraries, there are many reasons why using a system such as libSBML is a 
vastly better choice. 

One of the features of libSBML is its facilities for manipulating mathematical 
formulas supporting differences in representation between SBML Level 1 and SBML 
Level 2. As discussed in more detail below, libSBML provides an API that allows 
working with formulas in both text-string and MathML (Ausbrooks et al., 2001) 
form, and to interconvert mathematical expressions between these forms. The utility 
of this facility extends beyond converting between SBML Level 1 and 2. Many 
software packages provide users with the ability to express formulas for such things 
as reaction rate expressions, and these packages’ interfaces often let users type in 
the formulas directly as text strings. LibSBML saves application programmers the 
work of developing formula manipulation and translation functionality. It makes 
it possible to translate those formula strings directly into Abstract Syntax Trees 
(ASTs), manipulate them using AST operations, and write them out in the MathML 
format of SBML Level 2. 

As discussed in Section 17.5.5, another feature of libSBML is the validation it 
performs on SBML inputs at the time of parsing files and data streams. This helps 
verify the correctness of models in a way that goes beyond simple syntactic valida- 
tion. Still another invaluable feature of libSBML is the domain-specific operations 
it provides beyond simple SBML-specific accessor facilities. Examples of such op- 
erations include obtaining a count of the number of boundary condition species, 
determining the modifier species of a reaction (assuming the reaction provides ki- 
netics), and constructing the stoichiometric matrix for all reactions in a model. 

Finally, libSBML is solidly written and tested. The entire library has been written 
by seasoned, professional software engineers using the test-driven approach (Beck, 
2002). The libSBML source code currently has 760 unit tests and over 3,400 
individual assertions. It represents a robust and well-tested system that others can 
build upon. 


17.5.3 Manipulating Mathematical Formulas 


In SBML Level 1, mathematical formulas are represented as text strings using a 
C-like syntax. We chose this representation because of its simplicity, widespread fa- 
miliarity, and use in applications such as Gepasi (Mendes, 1997) and Jarnac (Sauro, 
2000), whose authors contributed to the initial design of SBML. For SBML Level 2, 
there was a need to expand the mathematical vocabulary of Level 1 to include 
additional functions (both built-in and user-defined), mathematical constants, log- 
ical operators, relational operators, and a special symbol to represent time. Rather 
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than growing the simple C-like syntax into something more complicated and eso- 
teric in order to support these features, and consequently having to manage two 
standards in two different formats (XML and text string formulas), we chose to 
leverage an existing standard for expressing mathematical formulas in Level 2: the 
content portion of MathML (Ausbrooks et al., 2001). 

Using MathML in SBML has at least two advantages. First, instead of reinventing 
the wheel, we are building upon an existing and well-established W3C standard. 
Second, since the entirety of a model is expressed in XML, SBML is now more 
amenable to tools that can process, manipulate, and store XML, such as (for 
example) XSLT (Clark and DeRose, 1999), XQuery (Fernández et al., 2005), 
XPath (Fernandez et al., 2005), and other XML technologies. That said, there are 
some disadvantages to using MathML. By introducing MathML part-way through 
the evolution of SBML, we have created a legacy support problem by having 
two formula representations with which to contend and interconvert. Also, most 
simulator packages cannot parse and understand MathML directly (but, we should 
point out the same would hold true had we chosen to expand the lowest-common- 
denominator C-like syntax of Level 1). Overcoming both of these disadvantages is 
easy with libSBML. 

Abstract Syntax Trees (ASTs) are well-known in the computer science commu- 
nity; they are simple recursive data structures useful for representing the syntactic 
structure of sentences in certain kinds of languages (mathematical or otherwise). 
Much as libSBML allows programmers to manipulate SBML at the level of domain- 
specific objects, regardless of SBML level or version, it also allows programmers to 
work with mathematical formula at the level of ASTs regardless of whether the 
original format was C-like infix notation or MathML. LibSBML goes one step fur- 
ther by allowing programmers to work exclusively with infix formula strings and 
instantly convert them to the appropriate MathML whenever needed. 

LibSBML ASTs provide a canonical, in-memory representation for all mathe- 
matical formulas regardless of their original format (that is, C-like infix strings or 
MathML). In libSBML, an AST is a collection of one or more ASTNodes. ASTNodes 
represent the most basic, indivisible part of a mathematical formula and come in 
many types. For instance, there are node types to represent numbers (with subtypes 
to distinguish integer, real, and rational numbers), names (for example, constants 
or variables), simple mathematical operators, logical or relational operators, and 
functions. Each ASTNode node may have none, one, two, or more child ASTNodes 
depending on its type. For instance, table 17.1 illustrates how the mathematical ex- 
pression 1 + 2, is represented as an AST with one plus node with two integer child 
nodes for the numbers 1 and 2, and the corresponding MathML representation. 


17.5.4 Performance of LibSBML 


XML parsers come in two popular varieties: Document Object Model (DOM) 
based and event-based. DOMs (Le Hors et al., 2000) are very generic in-memory 
structures that nearly duplicate the tree-like structure of the XML on disk. Using 
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Infix AST MathML 
<math xmlns="http://www.w3.org/1998/Math/MathML"> 


<apply> 
<plus/> 
1+2 <= <> <cn type="integer"> 1 </cn> 
<cn type="integer"> 2 </cn> 
</apply> 


</math> 








Table 17.1 Illustration of a simple mathematical expression represented in both 
libSBML’s AST structure and MathML (Ausbrooks et al., 2001). 


a DOM simply moves the parsing bump under the rug. Instead of parsing a file, 
one now has to parse an in-memory data structure. Moreover, because DOMs are 
generic, needing to handle any XML that comes their way, one pays a penalty in 
terms of large memory consumption. Event-based parsers, on the other hand, allow 
programmers to intercept specific XML events (tags) and act on them. Event-based 
parsers are memory-efficient, but are often too low-level and fined-grained. They 
therefore lack the convenience of manipulating XML data in larger logical units. 

LibSBML aims to strike a balance between DOM and event-based models of 
XML parsing. It provides the conveniences of a domain-specific object model while 
keeping memory usage to a minimum. Below, we compare the performance of 
libSBML, which uses the Xerces-C+-+ (Apache Software Foundation, 2004) event- 
based SAX parser under the hood, to parsing SBML with the Xerces-C-++ DOM. 

We obtained memory consumption statistics by writing two simple programs to 
read an SBML model from file into memory. One program used libSBML to read the 
model into domain-specific SBML objects and the other program used Xerces-C++ 
2.6 to read the model into the W3C XML DOM format (Le Hors et al., 2000). Each 
program recorded its total resident memory consumption immediately before and 
after reading the model and reported the difference between these two numbers. 

Total resident memory gives an estimate not only of the size of the model in 
memory, but also the size of the library and all supporting code that must be 
loaded into memory (often of concern to programmers). LibSBML was compiled 
with Xerces 2.6, so the amount of memory consumed by the Xerces library itself is 
the same for both programs. 

We ran both programs over the 10,000+ models in the SBML Test Suite (SBML 
Team, 2005a) and models used in the first SBML Hackathon. Individual file sizes 
varied from 600 bytes to 5.76 MBytes. The runs were performed on computers 
running SuSE Linux 9.1 (Novell, Inc., 2005) with dual 64-bit AMD Opteron 2.2 GHz 
processors (Advanced Micro Devices, Inc., 2005). 

Figure 17.5 shows a plot of the file size on disk versus the object model size 
in memory. While the Xerces-C++ 2.6 DOM is more efficient than previous 
implementations, the DOM consumed nearly five times as much memory for large 
multi-megabyte files. For small files (under five kilobytes), the DOM is ever so 
slightly more efficient. This is likely because Xerces uses string pooling and other 
reference counting techniques to optimize memory usage. For SBML files larger 
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than five kilobytes and especially files larger than one megabyte, libSBML is the 
clear performance winner. 
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Figure 17.5 A plot of memory consumption by libSBML (solid line) and the Xerces- 
C++ Document Object Model (dotted line), when each is used to read SBML models into 
computer memory. Data are based on over 10,000 sample models taken from the SBML 
Semantic Validation Suite and the first SBML Hackathon of 2003. File sizes (horizontal 
axis) varied from 600 bytes to 5.76 MBytes. 


17.5.5 Helping Ensure Correctness and Consistency 


Syntactic validation involves verifying that the SBML input is well-formed, and, for 
example, that data values are of the correct types. Consistency checking involves 
verifying the contents of an SBML model for self-consistency, referential integrity, 
and adherence to the SBML specifications. The tests are implemented as individual 
constraints within libSBML,; the library reports back validation failures to the call- 
ing application via the libSBML API. The constraint checking system is modular, 
and the constraint set can be easily extended. We describe the design and intent of 
the constraint syntax below. 

The design of SBML is driven by data models instead of the specifics of XML 
representation. To that end, the SBML specification is first described using UML 
static class diagrams. These class descriptions are mapped to XML representations 
using SCHUCS (Hucka, 2000), a technique we developed, tailored to producing effi- 
cient, reasonably succinct, and quasi-human-readable XML. We wanted to parallel 
our emphasis on data over representation, with a declarative language to express 
SBML model constraints (declarative languages state the what without specifying 
the how). For this, we took inspiration from the UML community and its develop- 
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ment of OCL, the Object Constraint Language (Object Management Group, Inc., 
2002; Warmer and Kleppe, 2003). 

Although libSBML consistency checks are not expressed directly in OCL, we have 
created an OCL-like language on top of the libSsBML C++ API. This language 
balances the readability of OCL with the efficiency and expressiveness of C++, 
which is sometimes necessary for more complicated validation procedures. The 
language allows the manipulation of only constant C++ objects, which much 
like OCL, guarantees operations will be side-effect free. Further, this guarantee 
is enforced at compile time. Being side-effect free is an important property as we 
do not want the process of consistency checking to change the state of a model. An 
example will help make these concepts more clear. 

One of the 50 consistency checks currently implemented ensures that if a model 
author overrides the default definition of the substance unit, a special unit name 
in SBML, the resulting unit definition is consistent with the notion of a substance. 
The consistency check constraint is written as: 


START_CONSTRAINT (1202, UnitDefinition, ud) 


{ 
msg = 
"A ?substance’ UnitDefinition must simplify to a single " 
"Unit of kind ’mole’ or ’item’ with an exponent of ’1’ " 
"(L2v1 Section 4.4.3)."; 
pre( ud.getId() == "substance" ); 
inv( ud.getNumUnits() = J; 
inv( ud.getUnit (0). isMole() || ud. ee isItem() ); 
inv( ud.getUnit(0).getExponent() = a3 
} 


END_CONSTRAINT 


The START_CONSTRAINT macro takes three arguments. The first is a number that 
uniquely identifies this constraint (that is, 1202). Assigning such identifiers to each 
constraint facilitates traceability and allows programmers to easily determine which 
rules have been violated. The next two parameters indicate the type of SBML object 
to which this rule applies (that is, UnitDefinition) and a shorthand name to use 
for the object being checked (that is, ud). 

The body of the constraint consists of a message (msg) to be logged should the 
SBML object fail the check. After the message, zero or more preconditions (pre) 
may be listed. In order for the rule to apply to the SBML object in question, all 
preconditions must hold (in the order listed). If a precondition does not hold, the 
check is aborted without logging either a passage or failure. Finally, assuming all 
preconditions hold, the object’s state must adhere to a set of one or more invariants 
(inv). Should any invariant fail, the constraint immediately fails and a message is 
logged. 

In the above example, notice that preconditions and invariants are specified on 
the (lib)SBML object model. Each method invocation (operation) does not change 
the state of the model and specifies what not how (with apologies made for the 
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standard names used for getter methods, for example, getUnit (), which arguably 
describes how and not what; even OCL falls victim to this slight, purely esthetic 
inconsistency.) 

Finally, it’s worth describing a case where OCL-like statements are not enough 
and having the full expressive power of C++ to write rules is advantageous. In 
SBML, compartments may be nested inside one another, with the limitation that 
this nesting may not be cyclic (an example of a cycle: compartment A is in B which 
is in C which is in A). While it is relatively easy to encode this constraint in the 
OCL-like language demonstrated above, reporting a user-friendly error message is 
another matter. Upon violation of this constraint, instead of simply stating that a 
cycle exists, it is better to indicate the chain of compartments that was followed 
to detect this cycle, thereby enabling the model author to quickly track down the 
cause of the error. Constructing such an informative error message is awkward in a 
purely declarative language like OCL. However, in C++, with its built-in Standard 
Template Library (STL) strings, sets, and the ability to iterate over collections, 
constructing an informative error message is straightforward. 


17.5.6 Open-Source Development 


We note with satisfaction that the open-source model of software development has 
been yielding dividends for libSBML. The user community has contributed not only 
several bug fixes, but new code as well. These include: support for the Expat parser 
library (Drake and Clark, 2005), a full Perl API, a full Lisp API, and an extension 
to support the use of a provisional SBML standard for storing model diagrams (see 
Section 17.4.3). 

The libSBML open-source license allows it to be incorporated freely into other 
programs in whole or part. Several simulator programs and projects developed in 
academia already make use of libSBML to support both SBML import and export. 
Such simulator programs include: Gepasi (Mendes, 1997), COPASI (Mendes, 2003), 
Jarnac (Sauro, 2000), and the DARPA Bio-SPICE project (Kumar and Feidler, 
2003). It is worth mentioning that since libSBML is distributed under the terms of 
the Lesser GNU Public License (LGPL) it may also be used without restriction in 
commercial applications (Free Software Foundation, 1999). We currently know of 
two commercial software applications using libSBML. 





17.6 Validating Application Behavior 


When we first developed SBML, we expected that most of the difficulties faced by 
developers in implementing software support would stem from issues of constructing 
and parsing valid model structures. We knew it would be impossible to write 
perfectly clear specifications for the language, but we expected that once issues of 
ambiguities and other problems in SBML’s definition were overcome, interchange of 
models between software tools would naturally follow. And to a surprising extent, 
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this was true for a few early applications such as Jarnac and Gepasi—exactly the 
same applications that informed the definition of SBML in the first place. It was not 
until a large number of other software developers began working with SBML that it 
became clear the community faced more subtle issues of model interpretation and 
consensual agreement about expected behaviors of simulation tools. 


17.6.1 Types of Validation 


At the highest level, we can partition the question of validity into two main 
categories: 


1. Syntactic: does the software accept well-formed SBML input, and reject all 
syntactically invalid SBML input? (Note that a software package may reject some 
valid SBML inputs because it detects the presence of constructs it is not designed 
to handle. For the purposes of syntactic verification, such behavior is acceptable 
and presumably can be distinguished from a failure to accept well-formed SBML.) 


2. Semantic: does the software interpret well-formed SBML correctly? This can be 
further divided: 


(a) Model structure: does the software construct the correct model structure 
based on the SBML input, independent of what it does with that structure? 


(b) Model behavior: does the software correctly interpret or generate the 
intended model behavior? 


The difference between the two types of semantic validation is about structure 
versus dynamics. Going beyond verification of conformance to SBML syntax, the 
semantic interpretation of a model involves both creating the intended constructs 
based on the SBML and analyzing or simulating the model in the intended way. In 
both cases, correctness is something that has to be carefully specified. 

Some models can only be evaluated based on their structure. For example, 
molecular interaction models may not contain any kinetic information, so it is not 
clear that there is a definable model behavior per se. In that case, the model may 
be only evaluable based on the model structure. Other models have dynamics, 
and software tools can be evaluated based on whether they produce agreed-upon 
simulation or analysis results. 


17.6.2 A Problem Not Addressed by Definitions Alone 


The problem of achieving “agreed-upon simulation and analysis results” goes deeper 
than stipulating the required syntactic and semantic aspects of SBML and providing 
model structure-based verification of consistency of the sort now available in 
libSBML (Section 17.5.5). At least two issues must be addressed. One is the problem 
of reaching a consensus in a community about how to to interpret different classes 
of models. This is a problem of education and communication, which in the case 
of SBML is being helped tremendously by the biannual SBML face-to-face events 
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(SBML Forums and Hackathons). A second problem is providing a way for software 
developers to verify the behaviors of their software tools vis-a-vis the consensus view 
of simulator behaviors. This requires testing the behavior of software that interprets 
and manipulates models encoded in SBML. 

To help address this latter problem, we have recently introduced the first version 
of the SBML Semantic Validation Suite, described in the next section. 


17.6.3 The SBML Semantic Validation Suite 


The Semantic Validation Suite consists of (1) a set of valid SBML models each with 
representative, simulated time-course data, and (2) a scripted, automated testing 
framework for running software tools through the suite. This suite is designed to 
be used by software developers to check that their simulators produce results that 
are consistent with the SBML standard and thus with each other. 

In the general case, verifying the interpretation of SBML by an arbitrary software 
package is an extremely challenging problem, since different applications use models 
in different ways, generate different types of outputs, and provide different user 
interfaces. The only realistic way to approach this problem systematically is to 
tackle different application types separately, treating ODE-based simulators as one 
type, stochastic simulators as another, pathway analysis tools as another, etcetera. 
We chose to develop tests for ODE-based simulators first because: (a) this kind 
of simulation software makes up a significant proportion of the applications that 
support SBML,; (b) simulation is one of the more complex types of analysis that 
can be applied to SBML; and (c) apart from metadata, almost all SBML features 
impact a model’s behavior in simulation. 

The set of models in the SBML Semantic Validation Suite is still incomplete, 
but the current version covers the majority of SBML features. The suite is divided 
into categories of tests, where each category deals with a set of related features of 
SBML. The scripts in the suite allow a simulator to be tested systematically against 
the test set. Each test in the suite comes with: (1) the correct simulation output 
in a consistent documented format; (2) plots of correct simulation output, and (3) 
documentation for the test. The beta version of the test suite was announced in 
October 2004. Several developers have begun using the suite as part of their work 
and communicating feedback to us about the suite itself; this feedback process is 
helping us to improve every aspect of it. 

Our long-term goal in this effort is to eventually produce a highly automated 
software evaluation system. We hope to be able to generate an in-depth guide that 
categorizes different tools along different dimensions related to their purposes and 
coverage of SBML features. This will be an important aid both to potential users 
(who will be able to easily compare the functionality of different software packages) 
and to developers (who will be able to use the evaluation tools to help guide their 
implementation of SBML support during software development). We also believe 
the content will be useful for researchers wishing to understand SBML on its own. 
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17.7 Summary 


Computational modeling is becoming crucial for making sense of the vast quantities 
of complex experimental data that are now being collected. The systems biology 
community needs agreed-upon information standards if models are to be shared, 
evaluated, and developed cooperatively. The Systems Biology Markup Language 
(SBML) is an XML-based format for representing computational models in a way 
that can be used by different software systems to communicate and exchange those 
models. It is supported today by over 80 software tools worldwide and a vibrant 
community of modelers and software authors. A variety of resources are available for 
working with SBML,; there is also an Internet MIME type defined for SBML (Kovitz, 
2004) and a new public database of models based around SBML (BioModels Team, 
2005). 

In support of SBML and its community, we continue to develop and make avail- 
able software infrastructure, including programming libraries, conversion utilities, 
interface packages for commonly-used software environments, and easy-to-access 
online tools. All of our software development follows the open-source tradition to 
maximize the accessibility and utility of the products. 

The success of SBML has led to requests from the community for new features 
and continued evolution of the language. We view our role as organizers and editors 
in the development and evolution of SBML; the process is open and crucially 
dependent on the involvement of others in the computational modeling field. We 
invite interested individuals and groups to join the SBML Forum, the informal 
community of SBML users and developers, to participate in the process and help 
us improve SBML and its capacity for acting as a common exchange format for 
computational modeling software in systems biology. Information on this and other 
aspects of the SBML project is available on the project Web site (SBML Team, 
2005b). 
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A Software Tools for Biological Modeling 


The software tools favored by the contributors to this book, all active researchers 
in biological modeling, are listed in this chapter. The selection is eclectic, practical, 
and is presented here more as a guide for intrepid readers to help them get their 
feet wet than as a complete list of canonical tools. We apologize for any omissions 
but note that any tool used frequently in publications will not want for users. Keep 
in mind that advances in software occur faster than advances in science. It follows 
that tools will either evolve in sophistication or users will migrate upwards. For 
users, open standards (see chapter 17) for model interchange are therefore crucial 
to avoid being in thrall to an out-dated program. 





A.1 Genetic Network Analyzer: GNA 


a Description: GUI with network visualization, model editor and visualization of 
simulation results (de Jong et al., 2003b) 


= System requirements: Java, runs under Windows, Unix, Solaris, MacOS 


a Features: Qualitative analysis: modeling, simulation, and analysis of genetic reg- 
ulatory networks described by piecewise-linear differential equation models supple- 
mented by parameter inequality constraints 


= Website: http: //www-helix.inrialpes.fr/gna 





A.2 Gene Interaction Network Simulator: GINsim 


a Description: GUI with network visualization, model editor, and visualization of 
simulation results (Chaouiya et al., 2003) 


= System requirements: Java, runs under Windows, Unix, Solaris, MacOS 


a Features: Qualitative analysis: modeling, simulation, and analysis of genetic 
regulatory networks described by discrete, logical models 


a Website: http: //www.esil.univ-mrs.fr/~chaouiya/GINsim 


380 


Software Tools for Biological Modeling 





A.3 Discrete Dynamics Lab: DDLab 


= Description: GUI with model editor and visualization of network dynamics 
(Wuensche, 2003) 


m System requirements: Written in C, runs under DOS, Unix, Linux, Irix 


= Features: Tools for researching cellular automata, random boolean networks, 
multi-value discrete dynamical networks 


m Website: http: //www.ddlab.com 





A.4 Cellerator 


m Description: Cell model generation and simulation, from reaction descriptions, 
within a powerful computer algebra system (Shapiro et al., 2003) 


= System requirements: Mathematica package 


= Features: Quick, easy model construction with palette; ODEs shown and solved 
Luxuriously supports the power math user Extensible: Biologists can add new 
reaction types (e.g. kMech add-on package for enzyme kinetics) 


= Website: http://www.igb.uci.edu/servers/sb.html 





A.5 Sigmoid 


= Description: Pathway modeling database and web pathway simulation environ- 
ment (Cheng et al., 2005) 
= System requirements: Java (1.4+), runs under Windows, Unix, Solaris, MacOS 


= Features: Web GUI access to Cellerator and pathway model database; scalability 
in organizing the great variety of biological mechanisms; flexible mapping from “bi- 
ological reaction type hierarchy” to “mathematical reaction model type hierarchy”; 
UML specification of reaction types and reactant types 


= Website: http: //www.sigmoid.org 





A.6 Metatool 


a Description: Structural network analysis for studying metabolic networks (Pfeiffer 
et al., 1999) 


m System requirements: Java, runs under Windows, Unix, Solaris, MacOS 


= Features: Conservation relations; null space analysis; calculation of elementary 
modes 


A.T FluxAnalyzer 
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= Website: 
http://pgrc-03.ipk-gatersleben.de/tools/phpMetatool/index.php 





A.7 FluxAnalyzer 


a Description: Structural network analysis completely embedded in a GUI with 
(optional) network visualisation (interactive flux maps) (Klamt et al., 2003) 

a System requirements: Matlab 

a Features: Calculation of graph-theoretical path lengths and network diameter; 
null space analysis; conservation relations; metabolic flux analysis; flux balance 
analysis; calculation and detailed analysis of elementary modes and extreme path- 
ways 


= Website: http: //www.mpi-magdeburg.mpg.de/projects/fluxanalyzer 





A.8 ScrumPy 


a Description: Simulator for general biochemical systems (Poolman et al., 2003) 
m System requirements: Python, mixture of command-line tools and GUIs 


= Features: Conservation relations; null space analysis; calculation of elementary 
modes 


a Website: http: //bms-mudshark.brookes.ac.uk/ScrumPy 





A.9 Jarnac 


= Description: Simulator for general biochemical systems (Sauro, 2000). 

a System requirements: Windows 95/98, NT, 2000 

a Features: Jarnac is a language for describing and manipulating cellular system 
models and can be used to describe metabolic, signal transduction, and gene 


networks, or in fact any physical system which can be described in terms of a 
network and associated flows. 


a Website: http: //www.cds.caltech.edu/~hsauro/Jarnac.htm 





A.10 Gepasi 


a Description: GUI simulator for general biochemical systems (Mendes, 1997). 


= System requirements: Windows 95 and up; Linux under Wine 
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= Features: Gepasi is a software package for modeling biochemical systems. It 
simulates the kinetics of systems of biochemical reactions and provides a number of 
tools to fit models to data, optimize any function of the model, perform metabolic 
control analysis and linear stability analysis. 


m Website: http: //www.gepasi.org 





A.11 MesoRD 


m Description: MesoRD is a tool for stochastic and deterministic simulation of 
reaction-diffusion systems. Reads SBML model descriptions. (Hattne et al., 2005) 
= System requirements: Linux, Mac OS X, NetBSD, Solaris and Windows XP 


= Features: Implements the next subvolume method; explicit unit handling; con- 
structive solid geometry is used for compartment geometry descriptions; MathML 
reaction rate expressions are automatically restructured for fast evaluation; evalu- 
ated reaction rates are hashed; licensed under the GNU GPL. 


= Website: http: //mesord.sourceforge.net 





A.12 Ingeneue 


= Description: Genetic network construction software (Meir et al., 2002) 
= System requirements: Java, runs under Windows, Unix, Solaris, MacOS 


= Features: Ingeneue is a general-purpose program designed to construct and ana- 
lyze models of genetic networks, designed so that it can be used by a biologist with 
only a minimal amount of mathematical training. 


= Website: http: //ingeneue.org 





A.13 XPPAUT 


a Description: Simulation and exploration of models of dynamical system (Ermen- 
trout, 2002) 


m System requirements: All platforms 


= Features: Xppaut is a program designed specifically for the needs of dynamical 
systems. It has many options for integrators and numerical algorithms and includes 
Auto for simple bifurcation continuations. It has a simple file format for the input 
of models and versatile graphing capabilities. 


m Website: http: //www.math.pitt.edu/~bard/xpp/xpp. html 


A.14 BioSens 
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A.14 BioSens 


= Description: GUI for methods to identify cellular architecture and dynamics from 
experimental data (Taylor et al., 2005) 


= System requirements: Windows, partial installation on Linux using XPP 


a Features: Dynamical sensitivity analysis; Fisher information matrix; FIM-based 
measurement selection 

= Website: 

http://www. chemengr.ucsb.edu/~ceweb/faculty/doyle/biosens/BioSens.htm 





A.15 JigCell 


a Description: Building models, simulation, comparison to experimental data, pa- 
rameter estimation (Vass et al., 2004) 


= System requirements: Java, runs under Windows, Unix, Solaris, MacOS 
a Features: SBML input 
= Website: http://jigcell.biol.vt.edu 





A.16 Oscill8 


a Description: Simulation and advanced bifurcation analysis 
a System requirements: Windows, Linux, Mac OS X 


a Features: Oscill8 is a suite of tools for analyzing large systems of ODEs, partic- 
ularly with respect to understanding how the high dimensional parameter space 
controls the dynamics of the system. 


= Website: http: //oscill8.sourceforge.net/ 





A.17 Madonna 


a Description: Simulation, sensitivity analysis, optimization 
= System requirements: Windows, Mac OS X 


a Features: Berkeley Madonna is a general purpose differential equation solver 
for the modeling and analysis of dynamical systems. Developed on the Berkeley 
campus under the sponsorship of NSF and NIH, it is currently used for constructing 
mathematical models for research and teaching. 


a Website: http: //www.berkeleymadonna.com/ 


384 Software Tools for Biological Modeling 





A.18 Systems Biology Workbench 


= Description: General frameworks for computational modules: Systems Biology 
Workbench, Matlab, Mathematica, Maple, Scilab, Octave 


= System requirements: All operating systems 


= Features: The Systems Biology Workbench is software that uses SBML (chap- 
ter 17) to allow communications between diverse software modules. A host of soft- 
ware packages are compatible with SBW (http://www.sys-bio.org). Maple, Math- 
ematica, Matlab, Octave, and Scilab are general purpose mathematical analysis 
software packages, with Maple and Mathematica more adept at algebraic manip- 
ulations and Matlab and Scilab more adept at numerical computations. Octave, 
Scilab, and the Systems Biology Workbench are free for use while the others are 
commercial. 


m Website: http: //sbml.org/index.psp 
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